diff options
Diffstat (limited to 'lib/tdb2/doc')
-rw-r--r-- | lib/tdb2/doc/TDB1_porting.txt | 44 | ||||
-rw-r--r-- | lib/tdb2/doc/design-1.3.txt | 1049 | ||||
-rw-r--r-- | lib/tdb2/doc/design.lyx | 2689 | ||||
-rw-r--r-- | lib/tdb2/doc/design.lyx,v | 4679 | ||||
-rw-r--r-- | lib/tdb2/doc/design.pdf | bin | 0 -> 240440 bytes | |||
-rw-r--r-- | lib/tdb2/doc/design.txt | 1258 |
6 files changed, 9719 insertions, 0 deletions
diff --git a/lib/tdb2/doc/TDB1_porting.txt b/lib/tdb2/doc/TDB1_porting.txt new file mode 100644 index 0000000000..90ba249738 --- /dev/null +++ b/lib/tdb2/doc/TDB1_porting.txt @@ -0,0 +1,44 @@ +Interface differences between TDB1 and TDB2. + +- tdb2 uses 'struct tdb_data', tdb1 uses 'struct TDB_DATA'. Use the + TDB_DATA typedef if you want portability between the two. + +- tdb2 functions return 0 on success, and a negative error on failure, + whereas tdb1 functions returned 0 on success, and -1 on failure. + tdb1 then used tdb_error() to determine the error; this is also + supported in tdb2 to ease backwards compatibility, though the other + form is preferred. + +- tdb2's tdb_fetch() returns an error, tdb1's returned the data directly + (or tdb_null, and you were supposed to check tdb_error() to find out why). + +- tdb2's tdb_nextkey() frees the old key's dptr, in tdb2 you needed to do + this manually. + +- tdb1's tdb_open/tdb_open_ex took an explicit hash size. tdb2's hash table + resizes as required. + +- tdb2 uses a linked list of attribute structures to implement logging and + alternate hashes. tdb1 used tdb_open_ex, which was not extensible. + +- tdb2 does locking on read-only databases (ie. O_RDONLY passed to tdb_open). + tdb1 did not: use the TDB_NOLOCK flag if you want to suppress locking. + +- tdb2's log function is simpler than tdb1's log function. The string is + already formatted, and it takes an enum tdb_log_level not a tdb_debug_level, + and which has only three values: TDB_LOG_ERROR, TDB_LOG_USE_ERROR and + TDB_LOG_WARNING. + +- tdb2 provides tdb_deq() for comparing two struct tdb_data. + +- tdb2's tdb_name() returns a copy of the name even for TDB_INTERNAL dbs. + +- tdb2 does not need tdb_reopen() or tdb_reopen_all(). If you call + fork() after during certain operations the child should close the + tdb, or complete the operations before continuing to use the tdb: + + tdb_transaction_start(): child must tdb_transaction_cancel() + tdb_lockall(): child must call tdb_unlockall() + tdb_lockall_read(): child must call tdb_unlockall_read() + tdb_chainlock(): child must call tdb_chainunlock() + tdb_parse() callback: child must return from tdb_parse() diff --git a/lib/tdb2/doc/design-1.3.txt b/lib/tdb2/doc/design-1.3.txt new file mode 100644 index 0000000000..f81ecf7885 --- /dev/null +++ b/lib/tdb2/doc/design-1.3.txt @@ -0,0 +1,1049 @@ +TDB2: A Redesigning The Trivial DataBase + +Rusty Russell, IBM Corporation + +27-April-2010 + +Abstract + +The Trivial DataBase on-disk format is 32 bits; with usage cases +heading towards the 4G limit, that must change. This required +breakage provides an opportunity to revisit TDB's other design +decisions and reassess them. + +1 Introduction + +The Trivial DataBase was originally written by Andrew Tridgell as +a simple key/data pair storage system with the same API as dbm, +but allowing multiple readers and writers while being small +enough (< 1000 lines of C) to include in SAMBA. The simple design +created in 1999 has proven surprisingly robust and performant, +used in Samba versions 3 and 4 as well as numerous other +projects. Its useful life was greatly increased by the +(backwards-compatible!) addition of transaction support in 2005. + +The wider variety and greater demands of TDB-using code has lead +to some organic growth of the API, as well as some compromises on +the implementation. None of these, by themselves, are seen as +show-stoppers, but the cumulative effect is to a loss of elegance +over the initial, simple TDB implementation. Here is a table of +the approximate number of lines of implementation code and number +of API functions at the end of each year: + + ++-----------+----------------+--------------------------------+ +| Year End | API Functions | Lines of C Code Implementation | ++-----------+----------------+--------------------------------+ ++-----------+----------------+--------------------------------+ +| 1999 | 13 | 1195 | ++-----------+----------------+--------------------------------+ +| 2000 | 24 | 1725 | ++-----------+----------------+--------------------------------+ +| 2001 | 32 | 2228 | ++-----------+----------------+--------------------------------+ +| 2002 | 35 | 2481 | ++-----------+----------------+--------------------------------+ +| 2003 | 35 | 2552 | ++-----------+----------------+--------------------------------+ +| 2004 | 40 | 2584 | ++-----------+----------------+--------------------------------+ +| 2005 | 38 | 2647 | ++-----------+----------------+--------------------------------+ +| 2006 | 52 | 3754 | ++-----------+----------------+--------------------------------+ +| 2007 | 66 | 4398 | ++-----------+----------------+--------------------------------+ +| 2008 | 71 | 4768 | ++-----------+----------------+--------------------------------+ +| 2009 | 73 | 5715 | ++-----------+----------------+--------------------------------+ + + +This review is an attempt to catalog and address all the known +issues with TDB and create solutions which address the problems +without significantly increasing complexity; all involved are far +too aware of the dangers of second system syndrome in rewriting a +successful project like this. + +2 API Issues + +2.1 tdb_open_ex Is Not Expandable + +The tdb_open() call was expanded to tdb_open_ex(), which added an +optional hashing function and an optional logging function +argument. Additional arguments to open would require the +introduction of a tdb_open_ex2 call etc. + +2.1.1 Proposed Solution + +tdb_open() will take a linked-list of attributes: + +enum tdb_attribute { + + TDB_ATTRIBUTE_LOG = 0, + + TDB_ATTRIBUTE_HASH = 1 + +}; + +struct tdb_attribute_base { + + enum tdb_attribute attr; + + union tdb_attribute *next; + +}; + +struct tdb_attribute_log { + + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG +*/ + + tdb_log_func log_fn; + + void *log_private; + +}; + +struct tdb_attribute_hash { + + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH +*/ + + tdb_hash_func hash_fn; + + void *hash_private; + +}; + +union tdb_attribute { + + struct tdb_attribute_base base; + + struct tdb_attribute_log log; + + struct tdb_attribute_hash hash; + +}; + +This allows future attributes to be added, even if this expands +the size of the union. + +2.2 tdb_traverse Makes Impossible Guarantees + +tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, +and it was thought that it was important to guarantee that all +records which exist at the start and end of the traversal would +be included, and no record would be included twice. + +This adds complexity (see[Reliable-Traversal-Adds]) and does not +work anyway for records which are altered (in particular, those +which are expanded may be effectively deleted and re-added behind +the traversal). + +2.2.1 <traverse-Proposed-Solution>Proposed Solution + +Abandon the guarantee. You will see every record if no changes +occur during your traversal, otherwise you will see some subset. +You can prevent changes by using a transaction or the locking +API. + +2.3 Nesting of Transactions Is Fraught + +TDB has alternated between allowing nested transactions and not +allowing them. Various paths in the Samba codebase assume that +transactions will nest, and in a sense they can: the operation is +only committed to disk when the outer transaction is committed. +There are two problems, however: + +1. Canceling the inner transaction will cause the outer + transaction commit to fail, and will not undo any operations + since the inner transaction began. This problem is soluble with + some additional internal code. + +2. An inner transaction commit can be cancelled by the outer + transaction. This is desirable in the way which Samba's + database initialization code uses transactions, but could be a + surprise to any users expecting a successful transaction commit + to expose changes to others. + +The current solution is to specify the behavior at tdb_open(), +with the default currently that nested transactions are allowed. +This flag can also be changed at runtime. + +2.3.1 Proposed Solution + +Given the usage patterns, it seems that the “least-surprise” +behavior of disallowing nested transactions should become the +default. Additionally, it seems the outer transaction is the only +code which knows whether inner transactions should be allowed, so +a flag to indicate this could be added to tdb_transaction_start. +However, this behavior can be simulated with a wrapper which uses +tdb_add_flags() and tdb_remove_flags(), so the API should not be +expanded for this relatively-obscure case. + +2.4 Incorrect Hash Function is Not Detected + +tdb_open_ex() allows the calling code to specify a different hash +function to use, but does not check that all other processes +accessing this tdb are using the same hash function. The result +is that records are missing from tdb_fetch(). + +2.4.1 Proposed Solution + +The header should contain an example hash result (eg. the hash of +0xdeadbeef), and tdb_open_ex() should check that the given hash +function produces the same answer, or fail the tdb_open call. + +2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation + +In response to scalability issues with the free list ([TDB-Freelist-Is] +) two API workarounds have been incorporated in TDB: +tdb_set_max_dead() and the TDB_VOLATILE flag to tdb_open. The +latter actually calls the former with an argument of “5”. + +This code allows deleted records to accumulate without putting +them in the free list. On delete we iterate through each chain +and free them in a batch if there are more than max_dead entries. +These are never otherwise recycled except as a side-effect of a +tdb_repack. + +2.5.1 Proposed Solution + +With the scalability problems of the freelist solved, this API +can be removed. The TDB_VOLATILE flag may still be useful as a +hint that store and delete of records will be at least as common +as fetch in order to allow some internal tuning, but initially +will become a no-op. + +2.6 <TDB-Files-Cannot>TDB Files Cannot Be Opened Multiple Times + In The Same Process + +No process can open the same TDB twice; we check and disallow it. +This is an unfortunate side-effect of fcntl locks, which operate +on a per-file rather than per-file-descriptor basis, and do not +nest. Thus, closing any file descriptor on a file clears all the +locks obtained by this process, even if they were placed using a +different file descriptor! + +Note that even if this were solved, deadlock could occur if +operations were nested: this is a more manageable programming +error in most cases. + +2.6.1 Proposed Solution + +We could lobby POSIX to fix the perverse rules, or at least lobby +Linux to violate them so that the most common implementation does +not have this restriction. This would be a generally good idea +for other fcntl lock users. + +Samba uses a wrapper which hands out the same tdb_context to +multiple callers if this happens, and does simple reference +counting. We should do this inside the tdb library, which already +emulates lock nesting internally; it would need to recognize when +deadlock occurs within a single process. This would create a new +failure mode for tdb operations (while we currently handle +locking failures, they are impossible in normal use and a process +encountering them can do little but give up). + +I do not see benefit in an additional tdb_open flag to indicate +whether re-opening is allowed, as though there may be some +benefit to adding a call to detect when a tdb_context is shared, +to allow other to create such an API. + +2.7 TDB API Is Not POSIX Thread-safe + +The TDB API uses an error code which can be queried after an +operation to determine what went wrong. This programming model +does not work with threads, unless specific additional guarantees +are given by the implementation. In addition, even +otherwise-independent threads cannot open the same TDB (as in [TDB-Files-Cannot] +). + +2.7.1 Proposed Solution + +Reachitecting the API to include a tdb_errcode pointer would be a +great deal of churn; we are better to guarantee that the +tdb_errcode is per-thread so the current programming model can be +maintained. + +This requires dynamic per-thread allocations, which is awkward +with POSIX threads (pthread_key_create space is limited and we +cannot simply allocate a key for every TDB). + +Internal locking is required to make sure that fcntl locks do not +overlap between threads, and also that the global list of tdbs is +maintained. + +The aim is that building tdb with -DTDB_PTHREAD will result in a +pthread-safe version of the library, and otherwise no overhead +will exist. + +2.8 *_nonblock Functions And *_mark Functions Expose + Implementation + +CTDB[footnote: +Clustered TDB, see http://ctdb.samba.org +] wishes to operate on TDB in a non-blocking manner. This is +currently done as follows: + +1. Call the _nonblock variant of an API function (eg. + tdb_lockall_nonblock). If this fails: + +2. Fork a child process, and wait for it to call the normal + variant (eg. tdb_lockall). + +3. If the child succeeds, call the _mark variant to indicate we + already have the locks (eg. tdb_lockall_mark). + +4. Upon completion, tell the child to release the locks (eg. + tdb_unlockall). + +5. Indicate to tdb that it should consider the locks removed (eg. + tdb_unlockall_mark). + +There are several issues with this approach. Firstly, adding two +new variants of each function clutters the API for an obscure +use, and so not all functions have three variants. Secondly, it +assumes that all paths of the functions ask for the same locks, +otherwise the parent process will have to get a lock which the +child doesn't have under some circumstances. I don't believe this +is currently the case, but it constrains the implementation. + +2.8.1 <Proposed-Solution-locking-hook>Proposed Solution + +Implement a hook for locking methods, so that the caller can +control the calls to create and remove fcntl locks. In this +scenario, ctdbd would operate as follows: + +1. Call the normal API function, eg tdb_lockall(). + +2. When the lock callback comes in, check if the child has the + lock. Initially, this is always false. If so, return 0. + Otherwise, try to obtain it in non-blocking mode. If that + fails, return EWOULDBLOCK. + +3. Release locks in the unlock callback as normal. + +4. If tdb_lockall() fails, see if we recorded a lock failure; if + so, call the child to repeat the operation. + +5. The child records what locks it obtains, and returns that + information to the parent. + +6. When the child has succeeded, goto 1. + +This is flexible enough to handle any potential locking scenario, +even when lock requirements change. It can be optimized so that +the parent does not release locks, just tells the child which +locks it doesn't need to obtain. + +It also keeps the complexity out of the API, and in ctdbd where +it is needed. + +2.9 tdb_chainlock Functions Expose Implementation + +tdb_chainlock locks some number of records, including the record +indicated by the given key. This gave atomicity guarantees; +no-one can start a transaction, alter, read or delete that key +while the lock is held. + +It also makes the same guarantee for any other key in the chain, +which is an internal implementation detail and potentially a +cause for deadlock. + +2.9.1 Proposed Solution + +None. It would be nice to have an explicit single entry lock +which effected no other keys. Unfortunately, this won't work for +an entry which doesn't exist. Thus while chainlock may be +implemented more efficiently for the existing case, it will still +have overlap issues with the non-existing case. So it is best to +keep the current (lack of) guarantee about which records will be +effected to avoid constraining our implementation. + +2.10 Signal Handling is Not Race-Free + +The tdb_setalarm_sigptr() call allows the caller's signal handler +to indicate that the tdb locking code should return with a +failure, rather than trying again when a signal is received (and +errno == EAGAIN). This is usually used to implement timeouts. + +Unfortunately, this does not work in the case where the signal is +received before the tdb code enters the fcntl() call to place the +lock: the code will sleep within the fcntl() code, unaware that +the signal wants it to exit. In the case of long timeouts, this +does not happen in practice. + +2.10.1 Proposed Solution + +The locking hooks proposed in[Proposed-Solution-locking-hook] +would allow the user to decide on whether to fail the lock +acquisition on a signal. This allows the caller to choose their +own compromise: they could narrow the race by checking +immediately before the fcntl call.[footnote: +It may be possible to make this race-free in some implementations +by having the signal handler alter the struct flock to make it +invalid. This will cause the fcntl() lock call to fail with +EINVAL if the signal occurs before the kernel is entered, +otherwise EAGAIN. +] + +2.11 The API Uses Gratuitous Typedefs, Capitals + +typedefs are useful for providing source compatibility when types +can differ across implementations, or arguably in the case of +function pointer definitions which are hard for humans to parse. +Otherwise it is simply obfuscation and pollutes the namespace. + +Capitalization is usually reserved for compile-time constants and +macros. + + TDB_CONTEXT There is no reason to use this over 'struct + tdb_context'; the definition isn't visible to the API user + anyway. + + TDB_DATA There is no reason to use this over struct TDB_DATA; + the struct needs to be understood by the API user. + + struct TDB_DATA This would normally be called 'struct + tdb_data'. + + enum TDB_ERROR Similarly, this would normally be enum + tdb_error. + +2.11.1 Proposed Solution + +None. Introducing lower case variants would please pedants like +myself, but if it were done the existing ones should be kept. +There is little point forcing a purely cosmetic change upon tdb +users. + +2.12 <tdb_log_func-Doesnt-Take>tdb_log_func Doesn't Take The + Private Pointer + +For API compatibility reasons, the logging function needs to call +tdb_get_logging_private() to retrieve the pointer registered by +the tdb_open_ex for logging. + +2.12.1 Proposed Solution + +It should simply take an extra argument, since we are prepared to +break the API/ABI. + +2.13 Various Callback Functions Are Not Typesafe + +The callback functions in tdb_set_logging_function (after [tdb_log_func-Doesnt-Take] + is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read +and tdb_check all take void * and must internally convert it to +the argument type they were expecting. + +If this type changes, the compiler will not produce warnings on +the callers, since it only sees void *. + +2.13.1 Proposed Solution + +With careful use of macros, we can create callback functions +which give a warning when used on gcc and the types of the +callback and its private argument differ. Unsupported compilers +will not give a warning, which is no worse than now. In addition, +the callbacks become clearer, as they need not use void * for +their parameter. + +See CCAN's typesafe_cb module at +http://ccan.ozlabs.org/info/typesafe_cb.html + +2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, + tdb_reopen_all Problematic + +The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB +file should be cleared if the caller discovers it is the only +process with the TDB open. However, if any caller does not +specify TDB_CLEAR_IF_FIRST it will not be detected, so will have +the TDB erased underneath them (usually resulting in a crash). + +There is a similar issue on fork(); if the parent exits (or +otherwise closes the tdb) before the child calls tdb_reopen_all() +to establish the lock used to indicate the TDB is opened by +someone, a TDB_CLEAR_IF_FIRST opener at that moment will believe +it alone has opened the TDB and will erase it. + +2.14.1 Proposed Solution + +Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but +see [TDB_CLEAR_IF_FIRST-Imposes-Performance]. + +3 Performance And Scalability Issues + +3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST + Imposes Performance Penalty + +When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is +placed at offset 4 (aka. the ACTIVE_LOCK). While these locks +never conflict in normal tdb usage, they do add substantial +overhead for most fcntl lock implementations when the kernel +scans to detect if a lock conflict exists. This is often a single +linked list, making the time to acquire and release a fcntl lock +O(N) where N is the number of processes with the TDB open, not +the number actually doing work. + +In a Samba server it is common to have huge numbers of clients +sitting idle, and thus they have weaned themselves off the +TDB_CLEAR_IF_FIRST flag.[footnote: +There is a flag to tdb_reopen_all() which is used for this +optimization: if the parent process will outlive the child, the +child does not need the ACTIVE_LOCK. This is a workaround for +this very performance issue. +] + +3.1.1 Proposed Solution + +Remove the flag. It was a neat idea, but even trivial servers +tend to know when they are initializing for the first time and +can simply unlink the old tdb at that point. + +3.2 TDB Files Have a 4G Limit + +This seems to be becoming an issue (so much for “trivial”!), +particularly for ldb. + +3.2.1 Proposed Solution + +A new, incompatible TDB format which uses 64 bit offsets +internally rather than 32 bit as now. For simplicity of endian +conversion (which TDB does on the fly if required), all values +will be 64 bit on disk. In practice, some upper bits may be used +for other purposes, but at least 56 bits will be available for +file offsets. + +tdb_open() will automatically detect the old version, and even +create them if TDB_VERSION6 is specified to tdb_open. + +32 bit processes will still be able to access TDBs larger than 4G +(assuming that their off_t allows them to seek to 64 bits), they +will gracefully fall back as they fail to mmap. This can happen +already with large TDBs. + +Old versions of tdb will fail to open the new TDB files (since 28 +August 2009, commit 398d0c29290: prior to that any unrecognized +file format would be erased and initialized as a fresh tdb!) + +3.3 TDB Records Have a 4G Limit + +This has not been a reported problem, and the API uses size_t +which can be 64 bit on 64 bit platforms. However, other limits +may have made such an issue moot. + +3.3.1 Proposed Solution + +Record sizes will be 64 bit, with an error returned on 32 bit +platforms which try to access such records (the current +implementation would return TDB_ERR_OOM in a similar case). It +seems unlikely that 32 bit keys will be a limitation, so the +implementation may not support this (see [sub:Records-Incur-A]). + +3.4 Hash Size Is Determined At TDB Creation Time + +TDB contains a number of hash chains in the header; the number is +specified at creation time, and defaults to 131. This is such a +bottleneck on large databases (as each hash chain gets quite +long), that LDB uses 10,000 for this hash. In general it is +impossible to know what the 'right' answer is at database +creation time. + +3.4.1 Proposed Solution + +After comprehensive performance testing on various scalable hash +variants[footnote: +http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 +This was annoying because I was previously convinced that an +expanding tree of hashes would be very close to optimal. +], it became clear that it is hard to beat a straight linear hash +table which doubles in size when it reaches saturation. There are +three details which become important: + +1. On encountering a full bucket, we use the next bucket. + +2. Extra hash bits are stored with the offset, to reduce + comparisons. + +3. A marker entry is used on deleting an entry. + +The doubling of the table must be done under a transaction; we +will not reduce it on deletion, so it will be an unusual case. It +will either be placed at the head (other entries will be moved +out the way so we can expand). We could have a pointer in the +header to the current hashtable location, but that pointer would +have to be read frequently to check for hashtable moves. + +The locking for this is slightly more complex than the chained +case; we currently have one lock per bucket, and that means we +would need to expand the lock if we overflow to the next bucket. +The frequency of such collisions will effect our locking +heuristics: we can always lock more buckets than we need. + +One possible optimization is to only re-check the hash size on an +insert or a lookup miss. + +3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended + +TDB uses a single linked list for the free list. Allocation +occurs as follows, using heuristics which have evolved over time: + +1. Get the free list lock for this whole operation. + +2. Multiply length by 1.25, so we always over-allocate by 25%. + +3. Set the slack multiplier to 1. + +4. Examine the current freelist entry: if it is > length but < + the current best case, remember it as the best case. + +5. Multiply the slack multiplier by 1.05. + +6. If our best fit so far is less than length * slack multiplier, + return it. The slack will be turned into a new free record if + it's large enough. + +7. Otherwise, go onto the next freelist entry. + +Deleting a record occurs as follows: + +1. Lock the hash chain for this whole operation. + +2. Walk the chain to find the record, keeping the prev pointer + offset. + +3. If max_dead is non-zero: + + (a) Walk the hash chain again and count the dead records. + + (b) If it's more than max_dead, bulk free all the dead ones + (similar to steps 4 and below, but the lock is only obtained + once). + + (c) Simply mark this record as dead and return. + +4. Get the free list lock for the remainder of this operation. + +5. <right-merging>Examine the following block to see if it is + free; if so, enlarge the current block and remove that block + from the free list. This was disabled, as removal from the free + list was O(entries-in-free-list). + +6. Examine the preceeding block to see if it is free: for this + reason, each block has a 32-bit tailer which indicates its + length. If it is free, expand it to cover our new block and + return. + +7. Otherwise, prepend ourselves to the free list. + +Disabling right-merging (step [right-merging]) causes +fragmentation; the other heuristics proved insufficient to +address this, so the final answer to this was that when we expand +the TDB file inside a transaction commit, we repack the entire +tdb. + +The single list lock limits our allocation rate; due to the other +issues this is not currently seen as a bottleneck. + +3.5.1 Proposed Solution + +The first step is to remove all the current heuristics, as they +obviously interact, then examine them once the lock contention is +addressed. + +The free list must be split to reduce contention. Assuming +perfect free merging, we can at most have 1 free list entry for +each entry. This implies that the number of free lists is related +to the size of the hash table, but as it is rare to walk a large +number of free list entries we can use far fewer, say 1/32 of the +number of hash buckets. + +There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented] +) but it's not clear this would reduce contention in the common +case where all processes are allocating/freeing the same size. +Thus we almost certainly need to divide in other ways: the most +obvious is to divide the file into zones, and using a free list +(or set of free lists) for each. This approximates address +ordering. + +Note that this means we need to split the free lists when we +expand the file; this is probably acceptable when we double the +hash table size, since that is such an expensive operation +already. In the case of increasing the file size, there is an +optimization we can use: if we use M in the formula above as the +file size rounded up to the next power of 2, we only need +reshuffle free lists when the file size crosses a power of 2 +boundary, and reshuffling the free lists is trivial: we simply +merge every consecutive pair of free lists. + +The basic algorithm is as follows. Freeing is simple: + +1. Identify the correct zone. + +2. Lock the corresponding list. + +3. Re-check the zone (we didn't have a lock, sizes could have + changed): relock if necessary. + +4. Place the freed entry in the list for that zone. + +Allocation is a little more complicated, as we perform delayed +coalescing at this point: + +1. Pick a zone either the zone we last freed into, or based on a “ + random” number. + +2. Lock the corresponding list. + +3. Re-check the zone: relock if necessary. + +4. If the top entry is -large enough, remove it from the list and + return it. + +5. Otherwise, coalesce entries in the list. + + (a) + + (b) + + (c) + + (d) + +6. If there was no entry large enough, unlock the list and try + the next zone. + +7. + +8. + +9. If no zone satisfies, expand the file. + +This optimizes rapid insert/delete of free list entries by not +coalescing them all the time.. First-fit address ordering +ordering seems to be fairly good for keeping fragmentation low +(see [sub:TDB-Becomes-Fragmented]). Note that address ordering +does not need a tailer to coalesce, though if we needed one we +could have one cheaply: see [sub:Records-Incur-A]. + + + +I anticipate that the number of entries in each free zone would +be small, but it might be worth using one free entry to hold +pointers to the others for cache efficiency. + +3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented + +Much of this is a result of allocation strategy[footnote: +The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 +ftp://ftp.cs.utexas.edu/pub/garbage/malloc/ismm98.ps +] and deliberate hobbling of coalescing; internal fragmentation +(aka overallocation) is deliberately set at 25%, and external +fragmentation is only cured by the decision to repack the entire +db when a transaction commit needs to enlarge the file. + +3.6.1 Proposed Solution + +The 25% overhead on allocation works in practice for ldb because +indexes tend to expand by one record at a time. This internal +fragmentation can be resolved by having an “expanded” bit in the +header to note entries that have previously expanded, and +allocating more space for them. + +There are is a spectrum of possible solutions for external +fragmentation: one is to use a fragmentation-avoiding allocation +strategy such as best-fit address-order allocator. The other end +of the spectrum would be to use a bump allocator (very fast and +simple) and simply repack the file when we reach the end. + +There are three problems with efficient fragmentation-avoiding +allocators: they are non-trivial, they tend to use a single free +list for each size, and there's no evidence that tdb allocation +patterns will match those recorded for general allocators (though +it seems likely). + +Thus we don't spend too much effort on external fragmentation; we +will be no worse than the current code if we need to repack on +occasion. More effort is spent on reducing freelist contention, +and reducing overhead. + +3.7 <sub:Records-Incur-A>Records Incur A 28-Byte Overhead + +Each TDB record has a header as follows: + +struct tdb_record { + + tdb_off_t next; /* offset of the next record in the list +*/ + + tdb_len_t rec_len; /* total byte length of record */ + + tdb_len_t key_len; /* byte length of key */ + + tdb_len_t data_len; /* byte length of data */ + + uint32_t full_hash; /* the full 32 bit hash of the key */ + + uint32_t magic; /* try to catch errors */ + + /* the following union is implied: + + union { + + char record[rec_len]; + + struct { + + char key[key_len]; + + char data[data_len]; + + } + + uint32_t totalsize; (tailer) + + } + + */ + +}; + +Naively, this would double to a 56-byte overhead on a 64 bit +implementation. + +3.7.1 Proposed Solution + +We can use various techniques to reduce this for an allocated +block: + +1. The 'next' pointer is not required, as we are using a flat + hash table. + +2. 'rec_len' can instead be expressed as an addition to key_len + and data_len (it accounts for wasted or overallocated length in + the record). Since the record length is always a multiple of 8, + we can conveniently fit it in 32 bits (representing up to 35 + bits). + +3. 'key_len' and 'data_len' can be reduced. I'm unwilling to + restrict 'data_len' to 32 bits, but instead we can combine the + two into one 64-bit field and using a 5 bit value which + indicates at what bit to divide the two. Keys are unlikely to + scale as fast as data, so I'm assuming a maximum key size of 32 + bits. + +4. 'full_hash' is used to avoid a memcmp on the “miss” case, but + this is diminishing returns after a handful of bits (at 10 + bits, it reduces 99.9% of false memcmp). As an aside, as the + lower bits are already incorporated in the hash table + resolution, the upper bits should be used here. + +5. 'magic' does not need to be enlarged: it currently reflects + one of 5 values (used, free, dead, recovery, and + unused_recovery). It is useful for quick sanity checking + however, and should not be eliminated. + +6. 'tailer' is only used to coalesce free blocks (so a block to + the right can find the header to check if this block is free). + This can be replaced by a single 'free' bit in the header of + the following block (and the tailer only exists in free + blocks).[footnote: +This technique from Thomas Standish. Data Structure Techniques. +Addison-Wesley, Reading, Massachusetts, 1980. +] The current proposed coalescing algorithm doesn't need this, + however. + +This produces a 16 byte used header like this: + +struct tdb_used_record { + + uint32_t magic : 16, + + prev_is_free: 1, + + key_data_divide: 5, + + top_hash: 10; + + uint32_t extra_octets; + + uint64_t key_and_data_len; + +}; + +And a free record like this: + +struct tdb_free_record { + + uint32_t free_magic; + + uint64_t total_length; + + ... + + uint64_t tailer; + +}; + + + +3.8 Transaction Commit Requires 4 fdatasync + +The current transaction algorithm is: + +1. write_recovery_data(); + +2. sync(); + +3. write_recovery_header(); + +4. sync(); + +5. overwrite_with_new_data(); + +6. sync(); + +7. remove_recovery_header(); + +8. sync(); + +On current ext3, each sync flushes all data to disk, so the next +3 syncs are relatively expensive. But this could become a +performance bottleneck on other filesystems such as ext4. + +3.8.1 Proposed Solution + + + + + + + + + +Neil Brown points out that this is overzealous, and only one sync +is needed: + +1. Bundle the recovery data, a transaction counter and a strong + checksum of the new data. + +2. Strong checksum that whole bundle. + +3. Store the bundle in the database. + +4. Overwrite the oldest of the two recovery pointers in the + header (identified using the transaction counter) with the + offset of this bundle. + +5. sync. + +6. Write the new data to the file. + +Checking for recovery means identifying the latest bundle with a +valid checksum and using the new data checksum to ensure that it +has been applied. This is more expensive than the current check, +but need only be done at open. For running databases, a separate +header field can be used to indicate a transaction in progress; +we need only check for recovery if this is set. + +3.9 TDB Does Not Have Snapshot Support + +3.9.1 Proposed Solution + +None. At some point you say “use a real database”. + +But as a thought experiment, if we implemented transactions to +only overwrite free entries (this is tricky: there must not be a +header in each entry which indicates whether it is free, but use +of presence in metadata elsewhere), and a pointer to the hash +table, we could create an entirely new commit without destroying +existing data. Then it would be easy to implement snapshots in a +similar way. + +This would not allow arbitrary changes to the database, such as +tdb_repack does, and would require more space (since we have to +preserve the current and future entries at once). If we used hash +trees rather than one big hash table, we might only have to +rewrite some sections of the hash, too. + +We could then implement snapshots using a similar method, using +multiple different hash tables/free tables. + +3.10 Transactions Cannot Operate in Parallel + +This would be useless for ldb, as it hits the index records with +just about every update. It would add significant complexity in +resolving clashes, and cause the all transaction callers to write +their code to loop in the case where the transactions spuriously +failed. + +3.10.1 Proposed Solution + +We could solve a small part of the problem by providing read-only +transactions. These would allow one write transaction to begin, +but it could not commit until all r/o transactions are done. This +would require a new RO_TRANSACTION_LOCK, which would be upgraded +on commit. + +3.11 Default Hash Function Is Suboptimal + +The Knuth-inspired multiplicative hash used by tdb is fairly slow +(especially if we expand it to 64 bits), and works best when the +hash bucket size is a prime number (which also means a slow +modulus). In addition, it is highly predictable which could +potentially lead to a Denial of Service attack in some TDB uses. + +3.11.1 Proposed Solution + +The Jenkins lookup3 hash[footnote: +http://burtleburtle.net/bob/c/lookup3.c +] is a fast and superbly-mixing hash. It's used by the Linux +kernel and almost everything else. This has the particular +properties that it takes an initial seed, and produces two 32 bit +hash numbers, which we can combine into a 64-bit hash. + +The seed should be created at tdb-creation time from some random +source, and placed in the header. This is far from foolproof, but +adds a little bit of protection against hash bombing. + +3.12 <Reliable-Traversal-Adds>Reliable Traversal Adds Complexity + +We lock a record during traversal iteration, and try to grab that +lock in the delete code. If that grab on delete fails, we simply +mark it deleted and continue onwards; traversal checks for this +condition and does the delete when it moves off the record. + +If traversal terminates, the dead record may be left +indefinitely. + +3.12.1 Proposed Solution + +Remove reliability guarantees; see [traverse-Proposed-Solution]. + +3.13 Fcntl Locking Adds Overhead + +Placing a fcntl lock means a system call, as does removing one. +This is actually one reason why transactions can be faster +(everything is locked once at transaction start). In the +uncontended case, this overhead can theoretically be eliminated. + +3.13.1 Proposed Solution + +None. + +We tried this before with spinlock support, in the early days of +TDB, and it didn't make much difference except in manufactured +benchmarks. + +We could use spinlocks (with futex kernel support under Linux), +but it means that we lose automatic cleanup when a process dies +with a lock. There is a method of auto-cleanup under Linux, but +it's not supported by other operating systems. We could +reintroduce a clear-if-first-style lock and sweep for dead +futexes on open, but that wouldn't help the normal case of one +concurrent opener dying. Increasingly elaborate repair schemes +could be considered, but they require an ABI change (everyone +must use them) anyway, so there's no need to do this at the same +time as everything else. diff --git a/lib/tdb2/doc/design.lyx b/lib/tdb2/doc/design.lyx new file mode 100644 index 0000000000..0a1d6a14bc --- /dev/null +++ b/lib/tdb2/doc/design.lyx @@ -0,0 +1,2689 @@ +#LyX 1.6.7 created this file. For more info see http://www.lyx.org/ +\lyxformat 345 +\begin_document +\begin_header +\textclass article +\use_default_options true +\language english +\inputencoding auto +\font_roman default +\font_sans default +\font_typewriter default +\font_default_family default +\font_sc false +\font_osf false +\font_sf_scale 100 +\font_tt_scale 100 + +\graphics default +\paperfontsize default +\use_hyperref false +\papersize default +\use_geometry false +\use_amsmath 1 +\use_esint 1 +\cite_engine basic +\use_bibtopic false +\paperorientation portrait +\secnumdepth 3 +\tocdepth 3 +\paragraph_separation indent +\defskip medskip +\quotes_language english +\papercolumns 1 +\papersides 1 +\paperpagestyle default +\tracking_changes true +\output_changes true +\author "" +\author "" +\end_header + +\begin_body + +\begin_layout Title +TDB2: A Redesigning The Trivial DataBase +\end_layout + +\begin_layout Author +Rusty Russell, IBM Corporation +\end_layout + +\begin_layout Date +17-March-2011 +\end_layout + +\begin_layout Abstract +The Trivial DataBase on-disk format is 32 bits; with usage cases heading + towards the 4G limit, that must change. + This required breakage provides an opportunity to revisit TDB's other design + decisions and reassess them. +\end_layout + +\begin_layout Section +Introduction +\end_layout + +\begin_layout Standard +The Trivial DataBase was originally written by Andrew Tridgell as a simple + key/data pair storage system with the same API as dbm, but allowing multiple + readers and writers while being small enough (< 1000 lines of C) to include + in SAMBA. + The simple design created in 1999 has proven surprisingly robust and performant +, used in Samba versions 3 and 4 as well as numerous other projects. + Its useful life was greatly increased by the (backwards-compatible!) addition + of transaction support in 2005. +\end_layout + +\begin_layout Standard +The wider variety and greater demands of TDB-using code has lead to some + organic growth of the API, as well as some compromises on the implementation. + None of these, by themselves, are seen as show-stoppers, but the cumulative + effect is to a loss of elegance over the initial, simple TDB implementation. + Here is a table of the approximate number of lines of implementation code + and number of API functions at the end of each year: +\end_layout + +\begin_layout Standard +\begin_inset Tabular +<lyxtabular version="3" rows="12" columns="3"> +<features> +<column alignment="center" valignment="top" width="0"> +<column alignment="center" valignment="top" width="0"> +<column alignment="center" valignment="top" width="0"> +<row> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +Year End +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +API Functions +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +Lines of C Code Implementation +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +1999 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +13 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +1195 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2000 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +24 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +1725 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2001 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +32 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2228 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2002 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +35 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2481 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2003 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +35 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2552 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2004 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +40 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2584 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2005 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +38 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2647 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2006 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +52 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +3754 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2007 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +66 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +4398 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2008 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +71 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +4768 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2009 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +73 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +5715 +\end_layout + +\end_inset +</cell> +</row> +</lyxtabular> + +\end_inset + + +\end_layout + +\begin_layout Standard +This review is an attempt to catalog and address all the known issues with + TDB and create solutions which address the problems without significantly + increasing complexity; all involved are far too aware of the dangers of + second system syndrome in rewriting a successful project like this. +\end_layout + +\begin_layout Section +API Issues +\end_layout + +\begin_layout Subsection +tdb_open_ex Is Not Expandable +\end_layout + +\begin_layout Standard +The tdb_open() call was expanded to tdb_open_ex(), which added an optional + hashing function and an optional logging function argument. + Additional arguments to open would require the introduction of a tdb_open_ex2 + call etc. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\begin_inset CommandInset label +LatexCommand label +name "attributes" + +\end_inset + + +\end_layout + +\begin_layout Standard +tdb_open() will take a linked-list of attributes: +\end_layout + +\begin_layout LyX-Code +enum tdb_attribute { +\end_layout + +\begin_layout LyX-Code + TDB_ATTRIBUTE_LOG = 0, +\end_layout + +\begin_layout LyX-Code + TDB_ATTRIBUTE_HASH = 1 +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +struct tdb_attribute_base { +\end_layout + +\begin_layout LyX-Code + enum tdb_attribute attr; +\end_layout + +\begin_layout LyX-Code + union tdb_attribute *next; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +struct tdb_attribute_log { +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */ +\end_layout + +\begin_layout LyX-Code + tdb_log_func log_fn; +\end_layout + +\begin_layout LyX-Code + void *log_private; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +struct tdb_attribute_hash { +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */ +\end_layout + +\begin_layout LyX-Code + tdb_hash_func hash_fn; +\end_layout + +\begin_layout LyX-Code + void *hash_private; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +union tdb_attribute { +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_base base; +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_log log; +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_hash hash; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard +This allows future attributes to be added, even if this expands the size + of the union. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +tdb_traverse Makes Impossible Guarantees +\end_layout + +\begin_layout Standard +tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it + was thought that it was important to guarantee that all records which exist + at the start and end of the traversal would be included, and no record + would be included twice. +\end_layout + +\begin_layout Standard +This adds complexity (see +\begin_inset CommandInset ref +LatexCommand ref +reference "Reliable-Traversal-Adds" + +\end_inset + +) and does not work anyway for records which are altered (in particular, + those which are expanded may be effectively deleted and re-added behind + the traversal). +\end_layout + +\begin_layout Subsubsection +\begin_inset CommandInset label +LatexCommand label +name "traverse-Proposed-Solution" + +\end_inset + +Proposed Solution +\end_layout + +\begin_layout Standard +Abandon the guarantee. + You will see every record if no changes occur during your traversal, otherwise + you will see some subset. + You can prevent changes by using a transaction or the locking API. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. + Delete-during-traverse will still delete every record, too (assuming no + other changes). +\end_layout + +\begin_layout Subsection +Nesting of Transactions Is Fraught +\end_layout + +\begin_layout Standard +TDB has alternated between allowing nested transactions and not allowing + them. + Various paths in the Samba codebase assume that transactions will nest, + and in a sense they can: the operation is only committed to disk when the + outer transaction is committed. + There are two problems, however: +\end_layout + +\begin_layout Enumerate +Canceling the inner transaction will cause the outer transaction commit + to fail, and will not undo any operations since the inner transaction began. + This problem is soluble with some additional internal code. +\end_layout + +\begin_layout Enumerate +An inner transaction commit can be cancelled by the outer transaction. + This is desirable in the way which Samba's database initialization code + uses transactions, but could be a surprise to any users expecting a successful + transaction commit to expose changes to others. +\end_layout + +\begin_layout Standard +The current solution is to specify the behavior at tdb_open(), with the + default currently that nested transactions are allowed. + This flag can also be changed at runtime. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Given the usage patterns, it seems that the +\begin_inset Quotes eld +\end_inset + +least-surprise +\begin_inset Quotes erd +\end_inset + + behavior of disallowing nested transactions should become the default. + Additionally, it seems the outer transaction is the only code which knows + whether inner transactions should be allowed, so a flag to indicate this + could be added to tdb_transaction_start. + However, this behavior can be simulated with a wrapper which uses tdb_add_flags +() and tdb_remove_flags(), so the API should not be expanded for this relatively +-obscure case. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete; the nesting flag has been removed. +\end_layout + +\begin_layout Subsection +Incorrect Hash Function is Not Detected +\end_layout + +\begin_layout Standard +tdb_open_ex() allows the calling code to specify a different hash function + to use, but does not check that all other processes accessing this tdb + are using the same hash function. + The result is that records are missing from tdb_fetch(). +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The header should contain an example hash result (eg. + the hash of 0xdeadbeef), and tdb_open_ex() should check that the given + hash function produces the same answer, or fail the tdb_open call. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +tdb_set_max_dead/TDB_VOLATILE Expose Implementation +\end_layout + +\begin_layout Standard +In response to scalability issues with the free list ( +\begin_inset CommandInset ref +LatexCommand ref +reference "TDB-Freelist-Is" + +\end_inset + +) two API workarounds have been incorporated in TDB: tdb_set_max_dead() + and the TDB_VOLATILE flag to tdb_open. + The latter actually calls the former with an argument of +\begin_inset Quotes eld +\end_inset + +5 +\begin_inset Quotes erd +\end_inset + +. +\end_layout + +\begin_layout Standard +This code allows deleted records to accumulate without putting them in the + free list. + On delete we iterate through each chain and free them in a batch if there + are more than max_dead entries. + These are never otherwise recycled except as a side-effect of a tdb_repack. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +With the scalability problems of the freelist solved, this API can be removed. + The TDB_VOLATILE flag may still be useful as a hint that store and delete + of records will be at least as common as fetch in order to allow some internal + tuning, but initially will become a no-op. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. + Unknown flags cause tdb_open() to fail as well, so they can be detected + at runtime. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "TDB-Files-Cannot" + +\end_inset + +TDB Files Cannot Be Opened Multiple Times In The Same Process +\end_layout + +\begin_layout Standard +No process can open the same TDB twice; we check and disallow it. + This is an unfortunate side-effect of fcntl locks, which operate on a per-file + rather than per-file-descriptor basis, and do not nest. + Thus, closing any file descriptor on a file clears all the locks obtained + by this process, even if they were placed using a different file descriptor! +\end_layout + +\begin_layout Standard +Note that even if this were solved, deadlock could occur if operations were + nested: this is a more manageable programming error in most cases. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +We could lobby POSIX to fix the perverse rules, or at least lobby Linux + to violate them so that the most common implementation does not have this + restriction. + This would be a generally good idea for other fcntl lock users. +\end_layout + +\begin_layout Standard +Samba uses a wrapper which hands out the same tdb_context to multiple callers + if this happens, and does simple reference counting. + We should do this inside the tdb library, which already emulates lock nesting + internally; it would need to recognize when deadlock occurs within a single + process. + This would create a new failure mode for tdb operations (while we currently + handle locking failures, they are impossible in normal use and a process + encountering them can do little but give up). +\end_layout + +\begin_layout Standard +I do not see benefit in an additional tdb_open flag to indicate whether + re-opening is allowed, as though there may be some benefit to adding a + call to detect when a tdb_context is shared, to allow other to create such + an API. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +TDB API Is Not POSIX Thread-safe +\end_layout + +\begin_layout Standard +The TDB API uses an error code which can be queried after an operation to + determine what went wrong. + This programming model does not work with threads, unless specific additional + guarantees are given by the implementation. + In addition, even otherwise-independent threads cannot open the same TDB + (as in +\begin_inset CommandInset ref +LatexCommand ref +reference "TDB-Files-Cannot" + +\end_inset + +). +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Reachitecting the API to include a tdb_errcode pointer would be a great + deal of churn, but fortunately most functions return 0 on success and -1 + on error: we can change these to return 0 on success and a negative error + code on error, and the API remains similar to previous. + The tdb_fetch, tdb_firstkey and tdb_nextkey functions need to take a TDB_DATA + pointer and return an error code. + It is also simpler to have tdb_nextkey replace its key argument in place, + freeing up any old .dptr. +\end_layout + +\begin_layout Standard +Internal locking is required to make sure that fcntl locks do not overlap + between threads, and also that the global list of tdbs is maintained. +\end_layout + +\begin_layout Standard +The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe + version of the library, and otherwise no overhead will exist. + Alternatively, a hooking mechanism similar to that proposed for +\begin_inset CommandInset ref +LatexCommand ref +reference "Proposed-Solution-locking-hook" + +\end_inset + + could be used to enable pthread locking at runtime. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete; API has been changed but thread safety has not been implemented. +\end_layout + +\begin_layout Subsection +*_nonblock Functions And *_mark Functions Expose Implementation +\end_layout + +\begin_layout Standard +CTDB +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +Clustered TDB, see http://ctdb.samba.org +\end_layout + +\end_inset + + wishes to operate on TDB in a non-blocking manner. + This is currently done as follows: +\end_layout + +\begin_layout Enumerate +Call the _nonblock variant of an API function (eg. + tdb_lockall_nonblock). + If this fails: +\end_layout + +\begin_layout Enumerate +Fork a child process, and wait for it to call the normal variant (eg. + tdb_lockall). +\end_layout + +\begin_layout Enumerate +If the child succeeds, call the _mark variant to indicate we already have + the locks (eg. + tdb_lockall_mark). +\end_layout + +\begin_layout Enumerate +Upon completion, tell the child to release the locks (eg. + tdb_unlockall). +\end_layout + +\begin_layout Enumerate +Indicate to tdb that it should consider the locks removed (eg. + tdb_unlockall_mark). +\end_layout + +\begin_layout Standard +There are several issues with this approach. + Firstly, adding two new variants of each function clutters the API for + an obscure use, and so not all functions have three variants. + Secondly, it assumes that all paths of the functions ask for the same locks, + otherwise the parent process will have to get a lock which the child doesn't + have under some circumstances. + I don't believe this is currently the case, but it constrains the implementatio +n. + +\end_layout + +\begin_layout Subsubsection +\begin_inset CommandInset label +LatexCommand label +name "Proposed-Solution-locking-hook" + +\end_inset + +Proposed Solution +\end_layout + +\begin_layout Standard +Implement a hook for locking methods, so that the caller can control the + calls to create and remove fcntl locks. + In this scenario, ctdbd would operate as follows: +\end_layout + +\begin_layout Enumerate +Call the normal API function, eg tdb_lockall(). +\end_layout + +\begin_layout Enumerate +When the lock callback comes in, check if the child has the lock. + Initially, this is always false. + If so, return 0. + Otherwise, try to obtain it in non-blocking mode. + If that fails, return EWOULDBLOCK. +\end_layout + +\begin_layout Enumerate +Release locks in the unlock callback as normal. +\end_layout + +\begin_layout Enumerate +If tdb_lockall() fails, see if we recorded a lock failure; if so, call the + child to repeat the operation. +\end_layout + +\begin_layout Enumerate +The child records what locks it obtains, and returns that information to + the parent. +\end_layout + +\begin_layout Enumerate +When the child has succeeded, goto 1. +\end_layout + +\begin_layout Standard +This is flexible enough to handle any potential locking scenario, even when + lock requirements change. + It can be optimized so that the parent does not release locks, just tells + the child which locks it doesn't need to obtain. +\end_layout + +\begin_layout Standard +It also keeps the complexity out of the API, and in ctdbd where it is needed. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +tdb_chainlock Functions Expose Implementation +\end_layout + +\begin_layout Standard +tdb_chainlock locks some number of records, including the record indicated + by the given key. + This gave atomicity guarantees; no-one can start a transaction, alter, + read or delete that key while the lock is held. +\end_layout + +\begin_layout Standard +It also makes the same guarantee for any other key in the chain, which is + an internal implementation detail and potentially a cause for deadlock. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. + It would be nice to have an explicit single entry lock which effected no + other keys. + Unfortunately, this won't work for an entry which doesn't exist. + Thus while chainlock may be implemented more efficiently for the existing + case, it will still have overlap issues with the non-existing case. + So it is best to keep the current (lack of) guarantee about which records + will be effected to avoid constraining our implementation. +\end_layout + +\begin_layout Subsection +Signal Handling is Not Race-Free +\end_layout + +\begin_layout Standard +The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate + that the tdb locking code should return with a failure, rather than trying + again when a signal is received (and errno == EAGAIN). + This is usually used to implement timeouts. +\end_layout + +\begin_layout Standard +Unfortunately, this does not work in the case where the signal is received + before the tdb code enters the fcntl() call to place the lock: the code + will sleep within the fcntl() code, unaware that the signal wants it to + exit. + In the case of long timeouts, this does not happen in practice. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The locking hooks proposed in +\begin_inset CommandInset ref +LatexCommand ref +reference "Proposed-Solution-locking-hook" + +\end_inset + + would allow the user to decide on whether to fail the lock acquisition + on a signal. + This allows the caller to choose their own compromise: they could narrow + the race by checking immediately before the fcntl call. +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +It may be possible to make this race-free in some implementations by having + the signal handler alter the struct flock to make it invalid. + This will cause the fcntl() lock call to fail with EINVAL if the signal + occurs before the kernel is entered, otherwise EAGAIN. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +The API Uses Gratuitous Typedefs, Capitals +\end_layout + +\begin_layout Standard +typedefs are useful for providing source compatibility when types can differ + across implementations, or arguably in the case of function pointer definitions + which are hard for humans to parse. + Otherwise it is simply obfuscation and pollutes the namespace. +\end_layout + +\begin_layout Standard +Capitalization is usually reserved for compile-time constants and macros. +\end_layout + +\begin_layout Description +TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the + definition isn't visible to the API user anyway. +\end_layout + +\begin_layout Description +TDB_DATA There is no reason to use this over struct TDB_DATA; the struct + needs to be understood by the API user. +\end_layout + +\begin_layout Description +struct +\begin_inset space ~ +\end_inset + +TDB_DATA This would normally be called 'struct tdb_data'. +\end_layout + +\begin_layout Description +enum +\begin_inset space ~ +\end_inset + +TDB_ERROR Similarly, this would normally be enum tdb_error. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. + Introducing lower case variants would please pedants like myself, but if + it were done the existing ones should be kept. + There is little point forcing a purely cosmetic change upon tdb users. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "tdb_log_func-Doesnt-Take" + +\end_inset + +tdb_log_func Doesn't Take The Private Pointer +\end_layout + +\begin_layout Standard +For API compatibility reasons, the logging function needs to call tdb_get_loggin +g_private() to retrieve the pointer registered by the tdb_open_ex for logging. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +It should simply take an extra argument, since we are prepared to break + the API/ABI. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Various Callback Functions Are Not Typesafe +\end_layout + +\begin_layout Standard +The callback functions in tdb_set_logging_function (after +\begin_inset CommandInset ref +LatexCommand ref +reference "tdb_log_func-Doesnt-Take" + +\end_inset + + is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check + all take void * and must internally convert it to the argument type they + were expecting. +\end_layout + +\begin_layout Standard +If this type changes, the compiler will not produce warnings on the callers, + since it only sees void *. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +With careful use of macros, we can create callback functions which give + a warning when used on gcc and the types of the callback and its private + argument differ. + Unsupported compilers will not give a warning, which is no worse than now. + In addition, the callbacks become clearer, as they need not use void * + for their parameter. +\end_layout + +\begin_layout Standard +See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic +\end_layout + +\begin_layout Standard +The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should + be cleared if the caller discovers it is the only process with the TDB + open. + However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not + be detected, so will have the TDB erased underneath them (usually resulting + in a crash). +\end_layout + +\begin_layout Standard +There is a similar issue on fork(); if the parent exits (or otherwise closes + the tdb) before the child calls tdb_reopen_all() to establish the lock + used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener + at that moment will believe it alone has opened the TDB and will erase + it. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Remove TDB_CLEAR_IF_FIRST. + Other workarounds are possible, but see +\begin_inset CommandInset ref +LatexCommand ref +reference "TDB_CLEAR_IF_FIRST-Imposes-Performance" + +\end_inset + +. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Extending The Header Is Difficult +\end_layout + +\begin_layout Standard +We have reserved (zeroed) words in the TDB header, which can be used for + future features. + If the future features are compulsory, the version number must be updated + to prevent old code from accessing the database. + But if the future feature is optional, we have no way of telling if older + code is accessing the database or not. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The header should contain a +\begin_inset Quotes eld +\end_inset + +format variant +\begin_inset Quotes erd +\end_inset + + value (64-bit). + This is divided into two 32-bit parts: +\end_layout + +\begin_layout Enumerate +The lower part reflects the format variant understood by code accessing + the database. +\end_layout + +\begin_layout Enumerate +The upper part reflects the format variant you must understand to write + to the database (otherwise you can only open for reading). +\end_layout + +\begin_layout Standard +The latter field can only be written at creation time, the former should + be written under the OPEN_LOCK when opening the database for writing, if + the variant of the code is lower than the current lowest variant. +\end_layout + +\begin_layout Standard +This should allow backwards-compatible features to be added, and detection + if older code (which doesn't understand the feature) writes to the database. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Record Headers Are Not Expandible +\end_layout + +\begin_layout Standard +If we later want to add (say) checksums on keys and data, it would require + another format change, which we'd like to avoid. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +We often have extra padding at the tail of a record. + If we ensure that the first byte (if any) of this padding is zero, we will + have a way for future changes to detect code which doesn't understand a + new format: the new code would write (say) a 1 at the tail, and thus if + there is no tail or the first byte is 0, we would know the extension is + not present on that record. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +TDB Does Not Use Talloc +\end_layout + +\begin_layout Standard +Many users of TDB (particularly Samba) use the talloc allocator, and thus + have to wrap TDB in a talloc context to use it conveniently. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The allocation within TDB is not complicated enough to justify the use of + talloc, and I am reluctant to force another (excellent) library on TDB + users. + Nonetheless a compromise is possible. + An attribute (see +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) can be added later to tdb_open() to provide an alternate allocation mechanism, + specifically for talloc but usable by any other allocator (which would + ignore the +\begin_inset Quotes eld +\end_inset + +context +\begin_inset Quotes erd +\end_inset + + argument). +\end_layout + +\begin_layout Standard +This would form a talloc heirarchy as expected, but the caller would still + have to attach a destructor to the tdb context returned from tdb_open to + close it. + All TDB_DATA fields would be children of the tdb_context, and the caller + would still have to manage them (using talloc_free() or talloc_steal()). +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Section +Performance And Scalability Issues +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "TDB_CLEAR_IF_FIRST-Imposes-Performance" + +\end_inset + +TDB_CLEAR_IF_FIRST Imposes Performance Penalty +\end_layout + +\begin_layout Standard +When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset + 4 (aka. + the ACTIVE_LOCK). + While these locks never conflict in normal tdb usage, they do add substantial + overhead for most fcntl lock implementations when the kernel scans to detect + if a lock conflict exists. + This is often a single linked list, making the time to acquire and release + a fcntl lock O(N) where N is the number of processes with the TDB open, + not the number actually doing work. +\end_layout + +\begin_layout Standard +In a Samba server it is common to have huge numbers of clients sitting idle, + and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag. +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +There is a flag to tdb_reopen_all() which is used for this optimization: + if the parent process will outlive the child, the child does not need the + ACTIVE_LOCK. + This is a workaround for this very performance issue. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Remove the flag. + It was a neat idea, but even trivial servers tend to know when they are + initializing for the first time and can simply unlink the old tdb at that + point. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +TDB Files Have a 4G Limit +\end_layout + +\begin_layout Standard +This seems to be becoming an issue (so much for +\begin_inset Quotes eld +\end_inset + +trivial +\begin_inset Quotes erd +\end_inset + +!), particularly for ldb. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +A new, incompatible TDB format which uses 64 bit offsets internally rather + than 32 bit as now. + For simplicity of endian conversion (which TDB does on the fly if required), + all values will be 64 bit on disk. + In practice, some upper bits may be used for other purposes, but at least + 56 bits will be available for file offsets. +\end_layout + +\begin_layout Standard +tdb_open() will automatically detect the old version, and even create them + if TDB_VERSION6 is specified to tdb_open. +\end_layout + +\begin_layout Standard +32 bit processes will still be able to access TDBs larger than 4G (assuming + that their off_t allows them to seek to 64 bits), they will gracefully + fall back as they fail to mmap. + This can happen already with large TDBs. +\end_layout + +\begin_layout Standard +Old versions of tdb will fail to open the new TDB files (since 28 August + 2009, commit 398d0c29290: prior to that any unrecognized file format would + be erased and initialized as a fresh tdb!) +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +TDB Records Have a 4G Limit +\end_layout + +\begin_layout Standard +This has not been a reported problem, and the API uses size_t which can + be 64 bit on 64 bit platforms. + However, other limits may have made such an issue moot. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Record sizes will be 64 bit, with an error returned on 32 bit platforms + which try to access such records (the current implementation would return + TDB_ERR_OOM in a similar case). + It seems unlikely that 32 bit keys will be a limitation, so the implementation + may not support this (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Records-Incur-A" + +\end_inset + +). +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Hash Size Is Determined At TDB Creation Time +\end_layout + +\begin_layout Standard +TDB contains a number of hash chains in the header; the number is specified + at creation time, and defaults to 131. + This is such a bottleneck on large databases (as each hash chain gets quite + long), that LDB uses 10,000 for this hash. + In general it is impossible to know what the 'right' answer is at database + creation time. +\end_layout + +\begin_layout Subsubsection +\begin_inset CommandInset label +LatexCommand label +name "sub:Hash-Size-Solution" + +\end_inset + +Proposed Solution +\end_layout + +\begin_layout Standard +After comprehensive performance testing on various scalable hash variants +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying + because I was previously convinced that an expanding tree of hashes would + be very close to optimal. +\end_layout + +\end_inset + +, it became clear that it is hard to beat a straight linear hash table which + doubles in size when it reaches saturation. + Unfortunately, altering the hash table introduces serious locking complications +: the entire hash table needs to be locked to enlarge the hash table, and + others might be holding locks. + Particularly insidious are insertions done under tdb_chainlock. +\end_layout + +\begin_layout Standard +Thus an expanding layered hash will be used: an array of hash groups, with + each hash group exploding into pointers to lower hash groups once it fills, + turning into a hash tree. + This has implications for locking: we must lock the entire group in case + we need to expand it, yet we don't know how deep the tree is at that point. +\end_layout + +\begin_layout Standard +Note that bits from the hash table entries should be stolen to hold more + hash bits to reduce the penalty of collisions. + We can use the otherwise-unused lower 3 bits. + If we limit the size of the database to 64 exabytes, we can use the top + 8 bits of the hash entry as well. + These 11 bits would reduce false positives down to 1 in 2000 which is more + than we need: we can use one of the bits to indicate that the extra hash + bits are valid. + This means we can choose not to re-hash all entries when we expand a hash + group; simply use the next bits we need and mark them invalid. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "TDB-Freelist-Is" + +\end_inset + +TDB Freelist Is Highly Contended +\end_layout + +\begin_layout Standard +TDB uses a single linked list for the free list. + Allocation occurs as follows, using heuristics which have evolved over + time: +\end_layout + +\begin_layout Enumerate +Get the free list lock for this whole operation. +\end_layout + +\begin_layout Enumerate +Multiply length by 1.25, so we always over-allocate by 25%. +\end_layout + +\begin_layout Enumerate +Set the slack multiplier to 1. +\end_layout + +\begin_layout Enumerate +Examine the current freelist entry: if it is > length but < the current + best case, remember it as the best case. +\end_layout + +\begin_layout Enumerate +Multiply the slack multiplier by 1.05. +\end_layout + +\begin_layout Enumerate +If our best fit so far is less than length * slack multiplier, return it. + The slack will be turned into a new free record if it's large enough. +\end_layout + +\begin_layout Enumerate +Otherwise, go onto the next freelist entry. +\end_layout + +\begin_layout Standard +Deleting a record occurs as follows: +\end_layout + +\begin_layout Enumerate +Lock the hash chain for this whole operation. +\end_layout + +\begin_layout Enumerate +Walk the chain to find the record, keeping the prev pointer offset. +\end_layout + +\begin_layout Enumerate +If max_dead is non-zero: +\end_layout + +\begin_deeper +\begin_layout Enumerate +Walk the hash chain again and count the dead records. +\end_layout + +\begin_layout Enumerate +If it's more than max_dead, bulk free all the dead ones (similar to steps + 4 and below, but the lock is only obtained once). +\end_layout + +\begin_layout Enumerate +Simply mark this record as dead and return. + +\end_layout + +\end_deeper +\begin_layout Enumerate +Get the free list lock for the remainder of this operation. +\end_layout + +\begin_layout Enumerate +\begin_inset CommandInset label +LatexCommand label +name "right-merging" + +\end_inset + +Examine the following block to see if it is free; if so, enlarge the current + block and remove that block from the free list. + This was disabled, as removal from the free list was O(entries-in-free-list). +\end_layout + +\begin_layout Enumerate +Examine the preceeding block to see if it is free: for this reason, each + block has a 32-bit tailer which indicates its length. + If it is free, expand it to cover our new block and return. +\end_layout + +\begin_layout Enumerate +Otherwise, prepend ourselves to the free list. +\end_layout + +\begin_layout Standard +Disabling right-merging (step +\begin_inset CommandInset ref +LatexCommand ref +reference "right-merging" + +\end_inset + +) causes fragmentation; the other heuristics proved insufficient to address + this, so the final answer to this was that when we expand the TDB file + inside a transaction commit, we repack the entire tdb. +\end_layout + +\begin_layout Standard +The single list lock limits our allocation rate; due to the other issues + this is not currently seen as a bottleneck. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The first step is to remove all the current heuristics, as they obviously + interact, then examine them once the lock contention is addressed. +\end_layout + +\begin_layout Standard +The free list must be split to reduce contention. + Assuming perfect free merging, we can at most have 1 free list entry for + each entry. + This implies that the number of free lists is related to the size of the + hash table, but as it is rare to walk a large number of free list entries + we can use far fewer, say 1/32 of the number of hash buckets. +\end_layout + +\begin_layout Standard +It seems tempting to try to reuse the hash implementation which we use for + records here, but we have two ways of searching for free entries: for allocatio +n we search by size (and possibly zone) which produces too many clashes + for our hash table to handle well, and for coalescing we search by address. + Thus an array of doubly-linked free lists seems preferable. +\end_layout + +\begin_layout Standard +There are various benefits in using per-size free lists (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:TDB-Becomes-Fragmented" + +\end_inset + +) but it's not clear this would reduce contention in the common case where + all processes are allocating/freeing the same size. + Thus we almost certainly need to divide in other ways: the most obvious + is to divide the file into zones, and using a free list (or table of free + lists) for each. + This approximates address ordering. +\end_layout + +\begin_layout Standard +Unfortunately it is difficult to know what heuristics should be used to + determine zone sizes, and our transaction code relies on being able to + create a +\begin_inset Quotes eld +\end_inset + +recovery area +\begin_inset Quotes erd +\end_inset + + by simply appending to the file (difficult if it would need to create a + new zone header). + Thus we use a linked-list of free tables; currently we only ever create + one, but if there is more than one we choose one at random to use. + In future we may use heuristics to add new free tables on contention. + We only expand the file when all free tables are exhausted. +\end_layout + +\begin_layout Standard +The basic algorithm is as follows. + Freeing is simple: +\end_layout + +\begin_layout Enumerate +Identify the correct free list. +\end_layout + +\begin_layout Enumerate +Lock the corresponding list. +\end_layout + +\begin_layout Enumerate +Re-check the list (we didn't have a lock, sizes could have changed): relock + if necessary. +\end_layout + +\begin_layout Enumerate +Place the freed entry in the list. +\end_layout + +\begin_layout Standard +Allocation is a little more complicated, as we perform delayed coalescing + at this point: +\end_layout + +\begin_layout Enumerate +Pick a free table; usually the previous one. +\end_layout + +\begin_layout Enumerate +Lock the corresponding list. +\end_layout + +\begin_layout Enumerate +If the top entry is -large enough, remove it from the list and return it. +\end_layout + +\begin_layout Enumerate +Otherwise, coalesce entries in the list.If there was no entry large enough, + unlock the list and try the next largest list +\end_layout + +\begin_layout Enumerate +If no list has an entry which meets our needs, try the next free table. +\end_layout + +\begin_layout Enumerate +If no zone satisfies, expand the file. +\end_layout + +\begin_layout Standard +This optimizes rapid insert/delete of free list entries by not coalescing + them all the time.. + First-fit address ordering ordering seems to be fairly good for keeping + fragmentation low (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:TDB-Becomes-Fragmented" + +\end_inset + +). + Note that address ordering does not need a tailer to coalesce, though if + we needed one we could have one cheaply: see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Records-Incur-A" + +\end_inset + +. + +\end_layout + +\begin_layout Standard +Each free entry has the free table number in the header: less than 255. + It also contains a doubly-linked list for easy deletion. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "sub:TDB-Becomes-Fragmented" + +\end_inset + +TDB Becomes Fragmented +\end_layout + +\begin_layout Standard +Much of this is a result of allocation strategy +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute +xas.edu/pub/garbage/malloc/ismm98.ps +\end_layout + +\end_inset + + and deliberate hobbling of coalescing; internal fragmentation (aka overallocati +on) is deliberately set at 25%, and external fragmentation is only cured + by the decision to repack the entire db when a transaction commit needs + to enlarge the file. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The 25% overhead on allocation works in practice for ldb because indexes + tend to expand by one record at a time. + This internal fragmentation can be resolved by having an +\begin_inset Quotes eld +\end_inset + +expanded +\begin_inset Quotes erd +\end_inset + + bit in the header to note entries that have previously expanded, and allocating + more space for them. +\end_layout + +\begin_layout Standard +There are is a spectrum of possible solutions for external fragmentation: + one is to use a fragmentation-avoiding allocation strategy such as best-fit + address-order allocator. + The other end of the spectrum would be to use a bump allocator (very fast + and simple) and simply repack the file when we reach the end. +\end_layout + +\begin_layout Standard +There are three problems with efficient fragmentation-avoiding allocators: + they are non-trivial, they tend to use a single free list for each size, + and there's no evidence that tdb allocation patterns will match those recorded + for general allocators (though it seems likely). +\end_layout + +\begin_layout Standard +Thus we don't spend too much effort on external fragmentation; we will be + no worse than the current code if we need to repack on occasion. + More effort is spent on reducing freelist contention, and reducing overhead. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "sub:Records-Incur-A" + +\end_inset + +Records Incur A 28-Byte Overhead +\end_layout + +\begin_layout Standard +Each TDB record has a header as follows: +\end_layout + +\begin_layout LyX-Code +struct tdb_record { +\end_layout + +\begin_layout LyX-Code + tdb_off_t next; /* offset of the next record in the list */ +\end_layout + +\begin_layout LyX-Code + tdb_len_t rec_len; /* total byte length of record */ +\end_layout + +\begin_layout LyX-Code + tdb_len_t key_len; /* byte length of key */ +\end_layout + +\begin_layout LyX-Code + tdb_len_t data_len; /* byte length of data */ +\end_layout + +\begin_layout LyX-Code + uint32_t full_hash; /* the full 32 bit hash of the key */ +\end_layout + +\begin_layout LyX-Code + uint32_t magic; /* try to catch errors */ +\end_layout + +\begin_layout LyX-Code + /* the following union is implied: +\end_layout + +\begin_layout LyX-Code + union { +\end_layout + +\begin_layout LyX-Code + char record[rec_len]; +\end_layout + +\begin_layout LyX-Code + struct { +\end_layout + +\begin_layout LyX-Code + char key[key_len]; +\end_layout + +\begin_layout LyX-Code + char data[data_len]; +\end_layout + +\begin_layout LyX-Code + } +\end_layout + +\begin_layout LyX-Code + uint32_t totalsize; (tailer) +\end_layout + +\begin_layout LyX-Code + } +\end_layout + +\begin_layout LyX-Code + */ +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard +Naively, this would double to a 56-byte overhead on a 64 bit implementation. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +We can use various techniques to reduce this for an allocated block: +\end_layout + +\begin_layout Enumerate +The 'next' pointer is not required, as we are using a flat hash table. +\end_layout + +\begin_layout Enumerate +'rec_len' can instead be expressed as an addition to key_len and data_len + (it accounts for wasted or overallocated length in the record). + Since the record length is always a multiple of 8, we can conveniently + fit it in 32 bits (representing up to 35 bits). +\end_layout + +\begin_layout Enumerate +'key_len' and 'data_len' can be reduced. + I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine + the two into one 64-bit field and using a 5 bit value which indicates at + what bit to divide the two. + Keys are unlikely to scale as fast as data, so I'm assuming a maximum key + size of 32 bits. +\end_layout + +\begin_layout Enumerate +'full_hash' is used to avoid a memcmp on the +\begin_inset Quotes eld +\end_inset + +miss +\begin_inset Quotes erd +\end_inset + + case, but this is diminishing returns after a handful of bits (at 10 bits, + it reduces 99.9% of false memcmp). + As an aside, as the lower bits are already incorporated in the hash table + resolution, the upper bits should be used here. + Note that it's not clear that these bits will be a win, given the extra + bits in the hash table itself (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Hash-Size-Solution" + +\end_inset + +). +\end_layout + +\begin_layout Enumerate +'magic' does not need to be enlarged: it currently reflects one of 5 values + (used, free, dead, recovery, and unused_recovery). + It is useful for quick sanity checking however, and should not be eliminated. +\end_layout + +\begin_layout Enumerate +'tailer' is only used to coalesce free blocks (so a block to the right can + find the header to check if this block is free). + This can be replaced by a single 'free' bit in the header of the following + block (and the tailer only exists in free blocks). +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +This technique from Thomas Standish. + Data Structure Techniques. + Addison-Wesley, Reading, Massachusetts, 1980. +\end_layout + +\end_inset + + The current proposed coalescing algorithm doesn't need this, however. +\end_layout + +\begin_layout Standard +This produces a 16 byte used header like this: +\end_layout + +\begin_layout LyX-Code +struct tdb_used_record { +\end_layout + +\begin_layout LyX-Code + uint32_t used_magic : 16, +\end_layout + +\begin_layout LyX-Code + +\end_layout + +\begin_layout LyX-Code + key_data_divide: 5, +\end_layout + +\begin_layout LyX-Code + top_hash: 11; +\end_layout + +\begin_layout LyX-Code + uint32_t extra_octets; +\end_layout + +\begin_layout LyX-Code + uint64_t key_and_data_len; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard +And a free record like this: +\end_layout + +\begin_layout LyX-Code +struct tdb_free_record { +\end_layout + +\begin_layout LyX-Code + uint64_t free_magic: 8, +\end_layout + +\begin_layout LyX-Code + prev : 56; +\end_layout + +\begin_layout LyX-Code + +\end_layout + +\begin_layout LyX-Code + uint64_t free_table: 8, +\end_layout + +\begin_layout LyX-Code + total_length : 56 +\end_layout + +\begin_layout LyX-Code + uint64_t next;; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard +Note that by limiting valid offsets to 56 bits, we can pack everything we + need into 3 64-byte words, meaning our minimum record size is 8 bytes. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Transaction Commit Requires 4 fdatasync +\end_layout + +\begin_layout Standard +The current transaction algorithm is: +\end_layout + +\begin_layout Enumerate +write_recovery_data(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Enumerate +write_recovery_header(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Enumerate +overwrite_with_new_data(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Enumerate +remove_recovery_header(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Standard +On current ext3, each sync flushes all data to disk, so the next 3 syncs + are relatively expensive. + But this could become a performance bottleneck on other filesystems such + as ext4. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Neil Brown points out that this is overzealous, and only one sync is needed: +\end_layout + +\begin_layout Enumerate +Bundle the recovery data, a transaction counter and a strong checksum of + the new data. +\end_layout + +\begin_layout Enumerate +Strong checksum that whole bundle. +\end_layout + +\begin_layout Enumerate +Store the bundle in the database. +\end_layout + +\begin_layout Enumerate +Overwrite the oldest of the two recovery pointers in the header (identified + using the transaction counter) with the offset of this bundle. +\end_layout + +\begin_layout Enumerate +sync. +\end_layout + +\begin_layout Enumerate +Write the new data to the file. +\end_layout + +\begin_layout Standard +Checking for recovery means identifying the latest bundle with a valid checksum + and using the new data checksum to ensure that it has been applied. + This is more expensive than the current check, but need only be done at + open. + For running databases, a separate header field can be used to indicate + a transaction in progress; we need only check for recovery if this is set. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "sub:TDB-Does-Not" + +\end_inset + +TDB Does Not Have Snapshot Support +\end_layout + +\begin_layout Subsubsection +Proposed SolutionNone. + At some point you say +\begin_inset Quotes eld +\end_inset + +use a real database +\begin_inset Quotes erd +\end_inset + + (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). +\end_layout + +\begin_layout Standard +But as a thought experiment, if we implemented transactions to only overwrite + free entries (this is tricky: there must not be a header in each entry + which indicates whether it is free, but use of presence in metadata elsewhere), + and a pointer to the hash table, we could create an entirely new commit + without destroying existing data. + Then it would be easy to implement snapshots in a similar way. +\end_layout + +\begin_layout Standard +This would not allow arbitrary changes to the database, such as tdb_repack + does, and would require more space (since we have to preserve the current + and future entries at once). + If we used hash trees rather than one big hash table, we might only have + to rewrite some sections of the hash, too. +\end_layout + +\begin_layout Standard +We could then implement snapshots using a similar method, using multiple + different hash tables/free tables. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Subsection +Transactions Cannot Operate in Parallel +\end_layout + +\begin_layout Standard +This would be useless for ldb, as it hits the index records with just about + every update. + It would add significant complexity in resolving clashes, and cause the + all transaction callers to write their code to loop in the case where the + transactions spuriously failed. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). + We could solve a small part of the problem by providing read-only transactions. + These would allow one write transaction to begin, but it could not commit + until all r/o transactions are done. + This would require a new RO_TRANSACTION_LOCK, which would be upgraded on + commit. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Subsection +Default Hash Function Is Suboptimal +\end_layout + +\begin_layout Standard +The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially + if we expand it to 64 bits), and works best when the hash bucket size is + a prime number (which also means a slow modulus). + In addition, it is highly predictable which could potentially lead to a + Denial of Service attack in some TDB uses. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The Jenkins lookup3 hash +\begin_inset Foot +status open + +\begin_layout Plain Layout +http://burtleburtle.net/bob/c/lookup3.c +\end_layout + +\end_inset + + is a fast and superbly-mixing hash. + It's used by the Linux kernel and almost everything else. + This has the particular properties that it takes an initial seed, and produces + two 32 bit hash numbers, which we can combine into a 64-bit hash. +\end_layout + +\begin_layout Standard +The seed should be created at tdb-creation time from some random source, + and placed in the header. + This is far from foolproof, but adds a little bit of protection against + hash bombing. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "Reliable-Traversal-Adds" + +\end_inset + +Reliable Traversal Adds Complexity +\end_layout + +\begin_layout Standard +We lock a record during traversal iteration, and try to grab that lock in + the delete code. + If that grab on delete fails, we simply mark it deleted and continue onwards; + traversal checks for this condition and does the delete when it moves off + the record. +\end_layout + +\begin_layout Standard +If traversal terminates, the dead record may be left indefinitely. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Remove reliability guarantees; see +\begin_inset CommandInset ref +LatexCommand ref +reference "traverse-Proposed-Solution" + +\end_inset + +. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Fcntl Locking Adds Overhead +\end_layout + +\begin_layout Standard +Placing a fcntl lock means a system call, as does removing one. + This is actually one reason why transactions can be faster (everything + is locked once at transaction start). + In the uncontended case, this overhead can theoretically be eliminated. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. +\end_layout + +\begin_layout Standard +We tried this before with spinlock support, in the early days of TDB, and + it didn't make much difference except in manufactured benchmarks. +\end_layout + +\begin_layout Standard +We could use spinlocks (with futex kernel support under Linux), but it means + that we lose automatic cleanup when a process dies with a lock. + There is a method of auto-cleanup under Linux, but it's not supported by + other operating systems. + We could reintroduce a clear-if-first-style lock and sweep for dead futexes + on open, but that wouldn't help the normal case of one concurrent opener + dying. + Increasingly elaborate repair schemes could be considered, but they require + an ABI change (everyone must use them) anyway, so there's no need to do + this at the same time as everything else. +\end_layout + +\begin_layout Subsection +Some Transactions Don't Require Durability +\end_layout + +\begin_layout Standard +Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast) + usage, and occasionally empties the results into a transactional TDB. + This kind of usage prioritizes performance over durability: as long as + we are consistent, data can be lost. +\end_layout + +\begin_layout Standard +This would be more neatly implemented inside tdb: a +\begin_inset Quotes eld +\end_inset + +soft +\begin_inset Quotes erd +\end_inset + + transaction commit (ie. + syncless) which meant that data may be reverted on a crash. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. +\end_layout + +\begin_layout Standard +Unfortunately any transaction scheme which overwrites old data requires + a sync before that overwrite to avoid the possibility of corruption. +\end_layout + +\begin_layout Standard +It seems possible to use a scheme similar to that described in +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:TDB-Does-Not" + +\end_inset + +,where transactions are committed without overwriting existing data, and + an array of top-level pointers were available in the header. + If the transaction is +\begin_inset Quotes eld +\end_inset + +soft +\begin_inset Quotes erd +\end_inset + + then we would not need a sync at all: existing processes would pick up + the new hash table and free list and work with that. +\end_layout + +\begin_layout Standard +At some later point, a sync would allow recovery of the old data into the + free lists (perhaps when the array of top-level pointers filled). + On crash, tdb_open() would examine the array of top levels, and apply the + transactions until it encountered an invalid checksum. +\end_layout + +\begin_layout Subsection +Tracing Is Fragile, Replay Is External +\end_layout + +\begin_layout Standard +The current TDB has compile-time-enabled tracing code, but it often breaks + as it is not enabled by default. + In a similar way, the ctdb code has an external wrapper which does replay + tracing so it can coordinate cluster-wide transactions. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\begin_inset CommandInset label +LatexCommand label +name "replay-attribute" + +\end_inset + + +\end_layout + +\begin_layout Standard +Tridge points out that an attribute can be later added to tdb_open (see + +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) to provide replay/trace hooks, which could become the basis for this and + future parallel transactions and snapshot support. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\end_body +\end_document diff --git a/lib/tdb2/doc/design.lyx,v b/lib/tdb2/doc/design.lyx,v new file mode 100644 index 0000000000..13e6387f7f --- /dev/null +++ b/lib/tdb2/doc/design.lyx,v @@ -0,0 +1,4679 @@ +head 1.13; +access; +symbols; +locks; strict; +comment @# @; + + +1.13 +date 2011.03.01.11.46.54; author rusty; state Exp; +branches; +next 1.12; + +1.12 +date 2010.12.01.12.20.49; author rusty; state Exp; +branches; +next 1.11; + +1.11 +date 2010.12.01.11.55.20; author rusty; state Exp; +branches; +next 1.10; + +1.10 +date 2010.09.14.00.33.57; author rusty; state Exp; +branches; +next 1.9; + +1.9 +date 2010.09.09.07.25.12; author rusty; state Exp; +branches; +next 1.8; + +1.8 +date 2010.09.02.02.29.05; author rusty; state Exp; +branches; +next 1.7; + +1.7 +date 2010.09.01.10.58.12; author rusty; state Exp; +branches; +next 1.6; + +1.6 +date 2010.08.02.00.21.43; author rusty; state Exp; +branches; +next 1.5; + +1.5 +date 2010.08.02.00.21.16; author rusty; state Exp; +branches; +next 1.4; + +1.4 +date 2010.05.10.13.09.11; author rusty; state Exp; +branches; +next 1.3; + +1.3 +date 2010.05.10.11.58.37; author rusty; state Exp; +branches; +next 1.2; + +1.2 +date 2010.05.10.05.35.13; author rusty; state Exp; +branches; +next 1.1; + +1.1 +date 2010.05.04.02.29.16; author rusty; state Exp; +branches; +next ; + + +desc +@First draft +@ + + +1.13 +log +@Thread-safe API +@ +text +@#LyX 1.6.7 created this file. For more info see http://www.lyx.org/ +\lyxformat 345 +\begin_document +\begin_header +\textclass article +\use_default_options true +\language english +\inputencoding auto +\font_roman default +\font_sans default +\font_typewriter default +\font_default_family default +\font_sc false +\font_osf false +\font_sf_scale 100 +\font_tt_scale 100 + +\graphics default +\paperfontsize default +\use_hyperref false +\papersize default +\use_geometry false +\use_amsmath 1 +\use_esint 1 +\cite_engine basic +\use_bibtopic false +\paperorientation portrait +\secnumdepth 3 +\tocdepth 3 +\paragraph_separation indent +\defskip medskip +\quotes_language english +\papercolumns 1 +\papersides 1 +\paperpagestyle default +\tracking_changes true +\output_changes true +\author "Rusty Russell,,," +\author "" +\end_header + +\begin_body + +\begin_layout Title +TDB2: A Redesigning The Trivial DataBase +\end_layout + +\begin_layout Author +Rusty Russell, IBM Corporation +\end_layout + +\begin_layout Date +1-December-2010 +\end_layout + +\begin_layout Abstract +The Trivial DataBase on-disk format is 32 bits; with usage cases heading + towards the 4G limit, that must change. + This required breakage provides an opportunity to revisit TDB's other design + decisions and reassess them. +\end_layout + +\begin_layout Section +Introduction +\end_layout + +\begin_layout Standard +The Trivial DataBase was originally written by Andrew Tridgell as a simple + key/data pair storage system with the same API as dbm, but allowing multiple + readers and writers while being small enough (< 1000 lines of C) to include + in SAMBA. + The simple design created in 1999 has proven surprisingly robust and performant +, used in Samba versions 3 and 4 as well as numerous other projects. + Its useful life was greatly increased by the (backwards-compatible!) addition + of transaction support in 2005. +\end_layout + +\begin_layout Standard +The wider variety and greater demands of TDB-using code has lead to some + organic growth of the API, as well as some compromises on the implementation. + None of these, by themselves, are seen as show-stoppers, but the cumulative + effect is to a loss of elegance over the initial, simple TDB implementation. + Here is a table of the approximate number of lines of implementation code + and number of API functions at the end of each year: +\end_layout + +\begin_layout Standard +\begin_inset Tabular +<lyxtabular version="3" rows="12" columns="3"> +<features> +<column alignment="center" valignment="top" width="0"> +<column alignment="center" valignment="top" width="0"> +<column alignment="center" valignment="top" width="0"> +<row> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +Year End +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +API Functions +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +Lines of C Code Implementation +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +1999 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +13 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +1195 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2000 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +24 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +1725 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2001 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +32 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2228 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2002 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +35 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2481 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2003 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +35 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2552 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2004 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +40 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2584 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2005 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +38 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2647 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2006 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +52 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +3754 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2007 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +66 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +4398 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2008 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +71 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +4768 +\end_layout + +\end_inset +</cell> +</row> +<row> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +2009 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +73 +\end_layout + +\end_inset +</cell> +<cell alignment="center" valignment="top" topline="true" bottomline="true" leftline="true" rightline="true" usebox="none"> +\begin_inset Text + +\begin_layout Plain Layout +5715 +\end_layout + +\end_inset +</cell> +</row> +</lyxtabular> + +\end_inset + + +\end_layout + +\begin_layout Standard +This review is an attempt to catalog and address all the known issues with + TDB and create solutions which address the problems without significantly + increasing complexity; all involved are far too aware of the dangers of + second system syndrome in rewriting a successful project like this. +\end_layout + +\begin_layout Section +API Issues +\end_layout + +\begin_layout Subsection +tdb_open_ex Is Not Expandable +\end_layout + +\begin_layout Standard +The tdb_open() call was expanded to tdb_open_ex(), which added an optional + hashing function and an optional logging function argument. + Additional arguments to open would require the introduction of a tdb_open_ex2 + call etc. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\begin_inset CommandInset label +LatexCommand label +name "attributes" + +\end_inset + + +\end_layout + +\begin_layout Standard +tdb_open() will take a linked-list of attributes: +\end_layout + +\begin_layout LyX-Code +enum tdb_attribute { +\end_layout + +\begin_layout LyX-Code + TDB_ATTRIBUTE_LOG = 0, +\end_layout + +\begin_layout LyX-Code + TDB_ATTRIBUTE_HASH = 1 +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +struct tdb_attribute_base { +\end_layout + +\begin_layout LyX-Code + enum tdb_attribute attr; +\end_layout + +\begin_layout LyX-Code + union tdb_attribute *next; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +struct tdb_attribute_log { +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG */ +\end_layout + +\begin_layout LyX-Code + tdb_log_func log_fn; +\end_layout + +\begin_layout LyX-Code + void *log_private; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +struct tdb_attribute_hash { +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH */ +\end_layout + +\begin_layout LyX-Code + tdb_hash_func hash_fn; +\end_layout + +\begin_layout LyX-Code + void *hash_private; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout LyX-Code +union tdb_attribute { +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_base base; +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_log log; +\end_layout + +\begin_layout LyX-Code + struct tdb_attribute_hash hash; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard +This allows future attributes to be added, even if this expands the size + of the union. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +tdb_traverse Makes Impossible Guarantees +\end_layout + +\begin_layout Standard +tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, and it + was thought that it was important to guarantee that all records which exist + at the start and end of the traversal would be included, and no record + would be included twice. +\end_layout + +\begin_layout Standard +This adds complexity (see +\begin_inset CommandInset ref +LatexCommand ref +reference "Reliable-Traversal-Adds" + +\end_inset + +) and does not work anyway for records which are altered (in particular, + those which are expanded may be effectively deleted and re-added behind + the traversal). +\end_layout + +\begin_layout Subsubsection +\begin_inset CommandInset label +LatexCommand label +name "traverse-Proposed-Solution" + +\end_inset + +Proposed Solution +\end_layout + +\begin_layout Standard +Abandon the guarantee. + You will see every record if no changes occur during your traversal, otherwise + you will see some subset. + You can prevent changes by using a transaction or the locking API. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. + Delete-during-traverse will still delete every record, too (assuming no + other changes). +\end_layout + +\begin_layout Subsection +Nesting of Transactions Is Fraught +\end_layout + +\begin_layout Standard +TDB has alternated between allowing nested transactions and not allowing + them. + Various paths in the Samba codebase assume that transactions will nest, + and in a sense they can: the operation is only committed to disk when the + outer transaction is committed. + There are two problems, however: +\end_layout + +\begin_layout Enumerate +Canceling the inner transaction will cause the outer transaction commit + to fail, and will not undo any operations since the inner transaction began. + This problem is soluble with some additional internal code. +\end_layout + +\begin_layout Enumerate +An inner transaction commit can be cancelled by the outer transaction. + This is desirable in the way which Samba's database initialization code + uses transactions, but could be a surprise to any users expecting a successful + transaction commit to expose changes to others. +\end_layout + +\begin_layout Standard +The current solution is to specify the behavior at tdb_open(), with the + default currently that nested transactions are allowed. + This flag can also be changed at runtime. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Given the usage patterns, it seems that the +\begin_inset Quotes eld +\end_inset + +least-surprise +\begin_inset Quotes erd +\end_inset + + behavior of disallowing nested transactions should become the default. + Additionally, it seems the outer transaction is the only code which knows + whether inner transactions should be allowed, so a flag to indicate this + could be added to tdb_transaction_start. + However, this behavior can be simulated with a wrapper which uses tdb_add_flags +() and tdb_remove_flags(), so the API should not be expanded for this relatively +-obscure case. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard + +\change_deleted 0 1298979572 +Incomplete; nesting flag is still defined as per tdb1. +\change_inserted 0 1298979584 +Complete; the nesting flag has been removed. +\change_unchanged + +\end_layout + +\begin_layout Subsection +Incorrect Hash Function is Not Detected +\end_layout + +\begin_layout Standard +tdb_open_ex() allows the calling code to specify a different hash function + to use, but does not check that all other processes accessing this tdb + are using the same hash function. + The result is that records are missing from tdb_fetch(). +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The header should contain an example hash result (eg. + the hash of 0xdeadbeef), and tdb_open_ex() should check that the given + hash function produces the same answer, or fail the tdb_open call. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +tdb_set_max_dead/TDB_VOLATILE Expose Implementation +\end_layout + +\begin_layout Standard +In response to scalability issues with the free list ( +\begin_inset CommandInset ref +LatexCommand ref +reference "TDB-Freelist-Is" + +\end_inset + +) two API workarounds have been incorporated in TDB: tdb_set_max_dead() + and the TDB_VOLATILE flag to tdb_open. + The latter actually calls the former with an argument of +\begin_inset Quotes eld +\end_inset + +5 +\begin_inset Quotes erd +\end_inset + +. +\end_layout + +\begin_layout Standard +This code allows deleted records to accumulate without putting them in the + free list. + On delete we iterate through each chain and free them in a batch if there + are more than max_dead entries. + These are never otherwise recycled except as a side-effect of a tdb_repack. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +With the scalability problems of the freelist solved, this API can be removed. + The TDB_VOLATILE flag may still be useful as a hint that store and delete + of records will be at least as common as fetch in order to allow some internal + tuning, but initially will become a no-op. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. + TDB_VOLATILE still defined, but implementation should fail on unknown flags + to be future-proof. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "TDB-Files-Cannot" + +\end_inset + +TDB Files Cannot Be Opened Multiple Times In The Same Process +\end_layout + +\begin_layout Standard +No process can open the same TDB twice; we check and disallow it. + This is an unfortunate side-effect of fcntl locks, which operate on a per-file + rather than per-file-descriptor basis, and do not nest. + Thus, closing any file descriptor on a file clears all the locks obtained + by this process, even if they were placed using a different file descriptor! +\end_layout + +\begin_layout Standard +Note that even if this were solved, deadlock could occur if operations were + nested: this is a more manageable programming error in most cases. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +We could lobby POSIX to fix the perverse rules, or at least lobby Linux + to violate them so that the most common implementation does not have this + restriction. + This would be a generally good idea for other fcntl lock users. +\end_layout + +\begin_layout Standard +Samba uses a wrapper which hands out the same tdb_context to multiple callers + if this happens, and does simple reference counting. + We should do this inside the tdb library, which already emulates lock nesting + internally; it would need to recognize when deadlock occurs within a single + process. + This would create a new failure mode for tdb operations (while we currently + handle locking failures, they are impossible in normal use and a process + encountering them can do little but give up). +\end_layout + +\begin_layout Standard +I do not see benefit in an additional tdb_open flag to indicate whether + re-opening is allowed, as though there may be some benefit to adding a + call to detect when a tdb_context is shared, to allow other to create such + an API. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +TDB API Is Not POSIX Thread-safe +\end_layout + +\begin_layout Standard +The TDB API uses an error code which can be queried after an operation to + determine what went wrong. + This programming model does not work with threads, unless specific additional + guarantees are given by the implementation. + In addition, even otherwise-independent threads cannot open the same TDB + (as in +\begin_inset CommandInset ref +LatexCommand ref +reference "TDB-Files-Cannot" + +\end_inset + +). +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Reachitecting the API to include a tdb_errcode pointer would be a great + deal of churn +\change_inserted 0 1298979557 +, but fortunately most functions return 0 on success and -1 on error: we + can change these to return 0 on success and a negative error code on error, + and the API remains similar to previous. + The tdb_fetch, tdb_firstkey and tdb_nextkey functions need to take a TDB_DATA + pointer and return an error code. + It is also simpler to have tdb_nextkey replace its key argument in place, + freeing up any old .dptr. +\end_layout + +\begin_layout Standard + +\change_deleted 0 1298979438 +; we are better to guarantee that the tdb_errcode is per-thread so the current + programming model can be maintained. +\end_layout + +\begin_layout Standard + +\change_deleted 0 1298979438 +This requires dynamic per-thread allocations, which is awkward with POSIX + threads (pthread_key_create space is limited and we cannot simply allocate + a key for every TDB). +\change_unchanged + +\end_layout + +\begin_layout Standard +Internal locking is required to make sure that fcntl locks do not overlap + between threads, and also that the global list of tdbs is maintained. +\end_layout + +\begin_layout Standard +The aim is that building tdb with -DTDB_PTHREAD will result in a pthread-safe + version of the library, and otherwise no overhead will exist. + Alternatively, a hooking mechanism similar to that proposed for +\begin_inset CommandInset ref +LatexCommand ref +reference "Proposed-Solution-locking-hook" + +\end_inset + + could be used to enable pthread locking at runtime. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete +\change_inserted 0 1298979681 +; API has been changed but thread safety has not been implemented. +\change_deleted 0 1298979669 +. +\change_unchanged + +\end_layout + +\begin_layout Subsection +*_nonblock Functions And *_mark Functions Expose Implementation +\end_layout + +\begin_layout Standard +CTDB +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +Clustered TDB, see http://ctdb.samba.org +\end_layout + +\end_inset + + wishes to operate on TDB in a non-blocking manner. + This is currently done as follows: +\end_layout + +\begin_layout Enumerate +Call the _nonblock variant of an API function (eg. + tdb_lockall_nonblock). + If this fails: +\end_layout + +\begin_layout Enumerate +Fork a child process, and wait for it to call the normal variant (eg. + tdb_lockall). +\end_layout + +\begin_layout Enumerate +If the child succeeds, call the _mark variant to indicate we already have + the locks (eg. + tdb_lockall_mark). +\end_layout + +\begin_layout Enumerate +Upon completion, tell the child to release the locks (eg. + tdb_unlockall). +\end_layout + +\begin_layout Enumerate +Indicate to tdb that it should consider the locks removed (eg. + tdb_unlockall_mark). +\end_layout + +\begin_layout Standard +There are several issues with this approach. + Firstly, adding two new variants of each function clutters the API for + an obscure use, and so not all functions have three variants. + Secondly, it assumes that all paths of the functions ask for the same locks, + otherwise the parent process will have to get a lock which the child doesn't + have under some circumstances. + I don't believe this is currently the case, but it constrains the implementatio +n. + +\end_layout + +\begin_layout Subsubsection +\begin_inset CommandInset label +LatexCommand label +name "Proposed-Solution-locking-hook" + +\end_inset + +Proposed Solution +\end_layout + +\begin_layout Standard +Implement a hook for locking methods, so that the caller can control the + calls to create and remove fcntl locks. + In this scenario, ctdbd would operate as follows: +\end_layout + +\begin_layout Enumerate +Call the normal API function, eg tdb_lockall(). +\end_layout + +\begin_layout Enumerate +When the lock callback comes in, check if the child has the lock. + Initially, this is always false. + If so, return 0. + Otherwise, try to obtain it in non-blocking mode. + If that fails, return EWOULDBLOCK. +\end_layout + +\begin_layout Enumerate +Release locks in the unlock callback as normal. +\end_layout + +\begin_layout Enumerate +If tdb_lockall() fails, see if we recorded a lock failure; if so, call the + child to repeat the operation. +\end_layout + +\begin_layout Enumerate +The child records what locks it obtains, and returns that information to + the parent. +\end_layout + +\begin_layout Enumerate +When the child has succeeded, goto 1. +\end_layout + +\begin_layout Standard +This is flexible enough to handle any potential locking scenario, even when + lock requirements change. + It can be optimized so that the parent does not release locks, just tells + the child which locks it doesn't need to obtain. +\end_layout + +\begin_layout Standard +It also keeps the complexity out of the API, and in ctdbd where it is needed. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +tdb_chainlock Functions Expose Implementation +\end_layout + +\begin_layout Standard +tdb_chainlock locks some number of records, including the record indicated + by the given key. + This gave atomicity guarantees; no-one can start a transaction, alter, + read or delete that key while the lock is held. +\end_layout + +\begin_layout Standard +It also makes the same guarantee for any other key in the chain, which is + an internal implementation detail and potentially a cause for deadlock. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. + It would be nice to have an explicit single entry lock which effected no + other keys. + Unfortunately, this won't work for an entry which doesn't exist. + Thus while chainlock may be implemented more efficiently for the existing + case, it will still have overlap issues with the non-existing case. + So it is best to keep the current (lack of) guarantee about which records + will be effected to avoid constraining our implementation. +\end_layout + +\begin_layout Subsection +Signal Handling is Not Race-Free +\end_layout + +\begin_layout Standard +The tdb_setalarm_sigptr() call allows the caller's signal handler to indicate + that the tdb locking code should return with a failure, rather than trying + again when a signal is received (and errno == EAGAIN). + This is usually used to implement timeouts. +\end_layout + +\begin_layout Standard +Unfortunately, this does not work in the case where the signal is received + before the tdb code enters the fcntl() call to place the lock: the code + will sleep within the fcntl() code, unaware that the signal wants it to + exit. + In the case of long timeouts, this does not happen in practice. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The locking hooks proposed in +\begin_inset CommandInset ref +LatexCommand ref +reference "Proposed-Solution-locking-hook" + +\end_inset + + would allow the user to decide on whether to fail the lock acquisition + on a signal. + This allows the caller to choose their own compromise: they could narrow + the race by checking immediately before the fcntl call. +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +It may be possible to make this race-free in some implementations by having + the signal handler alter the struct flock to make it invalid. + This will cause the fcntl() lock call to fail with EINVAL if the signal + occurs before the kernel is entered, otherwise EAGAIN. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +The API Uses Gratuitous Typedefs, Capitals +\end_layout + +\begin_layout Standard +typedefs are useful for providing source compatibility when types can differ + across implementations, or arguably in the case of function pointer definitions + which are hard for humans to parse. + Otherwise it is simply obfuscation and pollutes the namespace. +\end_layout + +\begin_layout Standard +Capitalization is usually reserved for compile-time constants and macros. +\end_layout + +\begin_layout Description +TDB_CONTEXT There is no reason to use this over 'struct tdb_context'; the + definition isn't visible to the API user anyway. +\end_layout + +\begin_layout Description +TDB_DATA There is no reason to use this over struct TDB_DATA; the struct + needs to be understood by the API user. +\end_layout + +\begin_layout Description +struct +\begin_inset space ~ +\end_inset + +TDB_DATA This would normally be called 'struct tdb_data'. +\end_layout + +\begin_layout Description +enum +\begin_inset space ~ +\end_inset + +TDB_ERROR Similarly, this would normally be enum tdb_error. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. + Introducing lower case variants would please pedants like myself, but if + it were done the existing ones should be kept. + There is little point forcing a purely cosmetic change upon tdb users. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "tdb_log_func-Doesnt-Take" + +\end_inset + +tdb_log_func Doesn't Take The Private Pointer +\end_layout + +\begin_layout Standard +For API compatibility reasons, the logging function needs to call tdb_get_loggin +g_private() to retrieve the pointer registered by the tdb_open_ex for logging. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +It should simply take an extra argument, since we are prepared to break + the API/ABI. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Various Callback Functions Are Not Typesafe +\end_layout + +\begin_layout Standard +The callback functions in tdb_set_logging_function (after +\begin_inset CommandInset ref +LatexCommand ref +reference "tdb_log_func-Doesnt-Take" + +\end_inset + + is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read and tdb_check + all take void * and must internally convert it to the argument type they + were expecting. +\end_layout + +\begin_layout Standard +If this type changes, the compiler will not produce warnings on the callers, + since it only sees void *. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +With careful use of macros, we can create callback functions which give + a warning when used on gcc and the types of the callback and its private + argument differ. + Unsupported compilers will not give a warning, which is no worse than now. + In addition, the callbacks become clearer, as they need not use void * + for their parameter. +\end_layout + +\begin_layout Standard +See CCAN's typesafe_cb module at http://ccan.ozlabs.org/info/typesafe_cb.html +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, tdb_reopen_all Problematic +\end_layout + +\begin_layout Standard +The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB file should + be cleared if the caller discovers it is the only process with the TDB + open. + However, if any caller does not specify TDB_CLEAR_IF_FIRST it will not + be detected, so will have the TDB erased underneath them (usually resulting + in a crash). +\end_layout + +\begin_layout Standard +There is a similar issue on fork(); if the parent exits (or otherwise closes + the tdb) before the child calls tdb_reopen_all() to establish the lock + used to indicate the TDB is opened by someone, a TDB_CLEAR_IF_FIRST opener + at that moment will believe it alone has opened the TDB and will erase + it. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Remove TDB_CLEAR_IF_FIRST. + Other workarounds are possible, but see +\begin_inset CommandInset ref +LatexCommand ref +reference "TDB_CLEAR_IF_FIRST-Imposes-Performance" + +\end_inset + +. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard + +\change_deleted 0 1298979699 +Incomplete, TDB_CLEAR_IF_FIRST still defined, but not implemented. +\change_inserted 0 1298979700 +Complete. +\change_unchanged + +\end_layout + +\begin_layout Subsection +Extending The Header Is Difficult +\end_layout + +\begin_layout Standard +We have reserved (zeroed) words in the TDB header, which can be used for + future features. + If the future features are compulsory, the version number must be updated + to prevent old code from accessing the database. + But if the future feature is optional, we have no way of telling if older + code is accessing the database or not. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The header should contain a +\begin_inset Quotes eld +\end_inset + +format variant +\begin_inset Quotes erd +\end_inset + + value (64-bit). + This is divided into two 32-bit parts: +\end_layout + +\begin_layout Enumerate +The lower part reflects the format variant understood by code accessing + the database. +\end_layout + +\begin_layout Enumerate +The upper part reflects the format variant you must understand to write + to the database (otherwise you can only open for reading). +\end_layout + +\begin_layout Standard +The latter field can only be written at creation time, the former should + be written under the OPEN_LOCK when opening the database for writing, if + the variant of the code is lower than the current lowest variant. +\end_layout + +\begin_layout Standard +This should allow backwards-compatible features to be added, and detection + if older code (which doesn't understand the feature) writes to the database. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +Record Headers Are Not Expandible +\end_layout + +\begin_layout Standard +If we later want to add (say) checksums on keys and data, it would require + another format change, which we'd like to avoid. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +We often have extra padding at the tail of a record. + If we ensure that the first byte (if any) of this padding is zero, we will + have a way for future changes to detect code which doesn't understand a + new format: the new code would write (say) a 1 at the tail, and thus if + there is no tail or the first byte is 0, we would know the extension is + not present on that record. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Incomplete. +\end_layout + +\begin_layout Subsection +TDB Does Not Use Talloc +\end_layout + +\begin_layout Standard +Many users of TDB (particularly Samba) use the talloc allocator, and thus + have to wrap TDB in a talloc context to use it conveniently. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The allocation within TDB is not complicated enough to justify the use of + talloc, and I am reluctant to force another (excellent) library on TDB + users. + Nonetheless a compromise is possible. + An attribute (see +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) can be added later to tdb_open() to provide an alternate allocation mechanism, + specifically for talloc but usable by any other allocator (which would + ignore the +\begin_inset Quotes eld +\end_inset + +context +\begin_inset Quotes erd +\end_inset + + argument). +\end_layout + +\begin_layout Standard +This would form a talloc heirarchy as expected, but the caller would still + have to attach a destructor to the tdb context returned from tdb_open to + close it. + All TDB_DATA fields would be children of the tdb_context, and the caller + would still have to manage them (using talloc_free() or talloc_steal()). +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Section +Performance And Scalability Issues +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "TDB_CLEAR_IF_FIRST-Imposes-Performance" + +\end_inset + +TDB_CLEAR_IF_FIRST Imposes Performance Penalty +\end_layout + +\begin_layout Standard +When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is placed at offset + 4 (aka. + the ACTIVE_LOCK). + While these locks never conflict in normal tdb usage, they do add substantial + overhead for most fcntl lock implementations when the kernel scans to detect + if a lock conflict exists. + This is often a single linked list, making the time to acquire and release + a fcntl lock O(N) where N is the number of processes with the TDB open, + not the number actually doing work. +\end_layout + +\begin_layout Standard +In a Samba server it is common to have huge numbers of clients sitting idle, + and thus they have weaned themselves off the TDB_CLEAR_IF_FIRST flag. +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +There is a flag to tdb_reopen_all() which is used for this optimization: + if the parent process will outlive the child, the child does not need the + ACTIVE_LOCK. + This is a workaround for this very performance issue. +\end_layout + +\end_inset + + +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Remove the flag. + It was a neat idea, but even trivial servers tend to know when they are + initializing for the first time and can simply unlink the old tdb at that + point. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard + +\change_deleted 0 1298979837 +Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing. +\change_inserted 0 1298979837 +Complete. +\change_unchanged + +\end_layout + +\begin_layout Subsection +TDB Files Have a 4G Limit +\end_layout + +\begin_layout Standard +This seems to be becoming an issue (so much for +\begin_inset Quotes eld +\end_inset + +trivial +\begin_inset Quotes erd +\end_inset + +!), particularly for ldb. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +A new, incompatible TDB format which uses 64 bit offsets internally rather + than 32 bit as now. + For simplicity of endian conversion (which TDB does on the fly if required), + all values will be 64 bit on disk. + In practice, some upper bits may be used for other purposes, but at least + 56 bits will be available for file offsets. +\end_layout + +\begin_layout Standard +tdb_open() will automatically detect the old version, and even create them + if TDB_VERSION6 is specified to tdb_open. +\end_layout + +\begin_layout Standard +32 bit processes will still be able to access TDBs larger than 4G (assuming + that their off_t allows them to seek to 64 bits), they will gracefully + fall back as they fail to mmap. + This can happen already with large TDBs. +\end_layout + +\begin_layout Standard +Old versions of tdb will fail to open the new TDB files (since 28 August + 2009, commit 398d0c29290: prior to that any unrecognized file format would + be erased and initialized as a fresh tdb!) +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +TDB Records Have a 4G Limit +\end_layout + +\begin_layout Standard +This has not been a reported problem, and the API uses size_t which can + be 64 bit on 64 bit platforms. + However, other limits may have made such an issue moot. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Record sizes will be 64 bit, with an error returned on 32 bit platforms + which try to access such records (the current implementation would return + TDB_ERR_OOM in a similar case). + It seems unlikely that 32 bit keys will be a limitation, so the implementation + may not support this (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Records-Incur-A" + +\end_inset + +). +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Hash Size Is Determined At TDB Creation Time +\end_layout + +\begin_layout Standard +TDB contains a number of hash chains in the header; the number is specified + at creation time, and defaults to 131. + This is such a bottleneck on large databases (as each hash chain gets quite + long), that LDB uses 10,000 for this hash. + In general it is impossible to know what the 'right' answer is at database + creation time. +\end_layout + +\begin_layout Subsubsection +\begin_inset CommandInset label +LatexCommand label +name "sub:Hash-Size-Solution" + +\end_inset + +Proposed Solution +\end_layout + +\begin_layout Standard +After comprehensive performance testing on various scalable hash variants +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 This was annoying + because I was previously convinced that an expanding tree of hashes would + be very close to optimal. +\end_layout + +\end_inset + +, it became clear that it is hard to beat a straight linear hash table which + doubles in size when it reaches saturation. + Unfortunately, altering the hash table introduces serious locking complications +: the entire hash table needs to be locked to enlarge the hash table, and + others might be holding locks. + Particularly insidious are insertions done under tdb_chainlock. +\end_layout + +\begin_layout Standard +Thus an expanding layered hash will be used: an array of hash groups, with + each hash group exploding into pointers to lower hash groups once it fills, + turning into a hash tree. + This has implications for locking: we must lock the entire group in case + we need to expand it, yet we don't know how deep the tree is at that point. +\end_layout + +\begin_layout Standard +Note that bits from the hash table entries should be stolen to hold more + hash bits to reduce the penalty of collisions. + We can use the otherwise-unused lower 3 bits. + If we limit the size of the database to 64 exabytes, we can use the top + 8 bits of the hash entry as well. + These 11 bits would reduce false positives down to 1 in 2000 which is more + than we need: we can use one of the bits to indicate that the extra hash + bits are valid. + This means we can choose not to re-hash all entries when we expand a hash + group; simply use the next bits we need and mark them invalid. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "TDB-Freelist-Is" + +\end_inset + +TDB Freelist Is Highly Contended +\end_layout + +\begin_layout Standard +TDB uses a single linked list for the free list. + Allocation occurs as follows, using heuristics which have evolved over + time: +\end_layout + +\begin_layout Enumerate +Get the free list lock for this whole operation. +\end_layout + +\begin_layout Enumerate +Multiply length by 1.25, so we always over-allocate by 25%. +\end_layout + +\begin_layout Enumerate +Set the slack multiplier to 1. +\end_layout + +\begin_layout Enumerate +Examine the current freelist entry: if it is > length but < the current + best case, remember it as the best case. +\end_layout + +\begin_layout Enumerate +Multiply the slack multiplier by 1.05. +\end_layout + +\begin_layout Enumerate +If our best fit so far is less than length * slack multiplier, return it. + The slack will be turned into a new free record if it's large enough. +\end_layout + +\begin_layout Enumerate +Otherwise, go onto the next freelist entry. +\end_layout + +\begin_layout Standard +Deleting a record occurs as follows: +\end_layout + +\begin_layout Enumerate +Lock the hash chain for this whole operation. +\end_layout + +\begin_layout Enumerate +Walk the chain to find the record, keeping the prev pointer offset. +\end_layout + +\begin_layout Enumerate +If max_dead is non-zero: +\end_layout + +\begin_deeper +\begin_layout Enumerate +Walk the hash chain again and count the dead records. +\end_layout + +\begin_layout Enumerate +If it's more than max_dead, bulk free all the dead ones (similar to steps + 4 and below, but the lock is only obtained once). +\end_layout + +\begin_layout Enumerate +Simply mark this record as dead and return. + +\end_layout + +\end_deeper +\begin_layout Enumerate +Get the free list lock for the remainder of this operation. +\end_layout + +\begin_layout Enumerate +\begin_inset CommandInset label +LatexCommand label +name "right-merging" + +\end_inset + +Examine the following block to see if it is free; if so, enlarge the current + block and remove that block from the free list. + This was disabled, as removal from the free list was O(entries-in-free-list). +\end_layout + +\begin_layout Enumerate +Examine the preceeding block to see if it is free: for this reason, each + block has a 32-bit tailer which indicates its length. + If it is free, expand it to cover our new block and return. +\end_layout + +\begin_layout Enumerate +Otherwise, prepend ourselves to the free list. +\end_layout + +\begin_layout Standard +Disabling right-merging (step +\begin_inset CommandInset ref +LatexCommand ref +reference "right-merging" + +\end_inset + +) causes fragmentation; the other heuristics proved insufficient to address + this, so the final answer to this was that when we expand the TDB file + inside a transaction commit, we repack the entire tdb. +\end_layout + +\begin_layout Standard +The single list lock limits our allocation rate; due to the other issues + this is not currently seen as a bottleneck. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The first step is to remove all the current heuristics, as they obviously + interact, then examine them once the lock contention is addressed. +\end_layout + +\begin_layout Standard +The free list must be split to reduce contention. + Assuming perfect free merging, we can at most have 1 free list entry for + each entry. + This implies that the number of free lists is related to the size of the + hash table, but as it is rare to walk a large number of free list entries + we can use far fewer, say 1/32 of the number of hash buckets. +\end_layout + +\begin_layout Standard +It seems tempting to try to reuse the hash implementation which we use for + records here, but we have two ways of searching for free entries: for allocatio +n we search by size (and possibly zone) which produces too many clashes + for our hash table to handle well, and for coalescing we search by address. + Thus an array of doubly-linked free lists seems preferable. +\end_layout + +\begin_layout Standard +There are various benefits in using per-size free lists (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:TDB-Becomes-Fragmented" + +\end_inset + +) but it's not clear this would reduce contention in the common case where + all processes are allocating/freeing the same size. + Thus we almost certainly need to divide in other ways: the most obvious + is to divide the file into zones, and using a free list (or table of free + lists) for each. + This approximates address ordering. +\end_layout + +\begin_layout Standard +Unfortunately it is difficult to know what heuristics should be used to + determine zone sizes, and our transaction code relies on being able to + create a +\begin_inset Quotes eld +\end_inset + +recovery area +\begin_inset Quotes erd +\end_inset + + by simply appending to the file (difficult if it would need to create a + new zone header). + Thus we use a linked-list of free tables; currently we only ever create + one, but if there is more than one we choose one at random to use. + In future we may use heuristics to add new free tables on contention. + We only expand the file when all free tables are exhausted. +\end_layout + +\begin_layout Standard +The basic algorithm is as follows. + Freeing is simple: +\end_layout + +\begin_layout Enumerate +Identify the correct free list. +\end_layout + +\begin_layout Enumerate +Lock the corresponding list. +\end_layout + +\begin_layout Enumerate +Re-check the list (we didn't have a lock, sizes could have changed): relock + if necessary. +\end_layout + +\begin_layout Enumerate +Place the freed entry in the list. +\end_layout + +\begin_layout Standard +Allocation is a little more complicated, as we perform delayed coalescing + at this point: +\end_layout + +\begin_layout Enumerate +Pick a free table; usually the previous one. +\end_layout + +\begin_layout Enumerate +Lock the corresponding list. +\end_layout + +\begin_layout Enumerate +If the top entry is -large enough, remove it from the list and return it. +\end_layout + +\begin_layout Enumerate +Otherwise, coalesce entries in the list.If there was no entry large enough, + unlock the list and try the next largest list +\end_layout + +\begin_layout Enumerate +If no list has an entry which meets our needs, try the next free table. +\end_layout + +\begin_layout Enumerate +If no zone satisfies, expand the file. +\end_layout + +\begin_layout Standard +This optimizes rapid insert/delete of free list entries by not coalescing + them all the time.. + First-fit address ordering ordering seems to be fairly good for keeping + fragmentation low (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:TDB-Becomes-Fragmented" + +\end_inset + +). + Note that address ordering does not need a tailer to coalesce, though if + we needed one we could have one cheaply: see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Records-Incur-A" + +\end_inset + +. + +\end_layout + +\begin_layout Standard +Each free entry has the free table number in the header: less than 255. + It also contains a doubly-linked list for easy deletion. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "sub:TDB-Becomes-Fragmented" + +\end_inset + +TDB Becomes Fragmented +\end_layout + +\begin_layout Standard +Much of this is a result of allocation strategy +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 ftp://ftp.cs.ute +xas.edu/pub/garbage/malloc/ismm98.ps +\end_layout + +\end_inset + + and deliberate hobbling of coalescing; internal fragmentation (aka overallocati +on) is deliberately set at 25%, and external fragmentation is only cured + by the decision to repack the entire db when a transaction commit needs + to enlarge the file. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The 25% overhead on allocation works in practice for ldb because indexes + tend to expand by one record at a time. + This internal fragmentation can be resolved by having an +\begin_inset Quotes eld +\end_inset + +expanded +\begin_inset Quotes erd +\end_inset + + bit in the header to note entries that have previously expanded, and allocating + more space for them. +\end_layout + +\begin_layout Standard +There are is a spectrum of possible solutions for external fragmentation: + one is to use a fragmentation-avoiding allocation strategy such as best-fit + address-order allocator. + The other end of the spectrum would be to use a bump allocator (very fast + and simple) and simply repack the file when we reach the end. +\end_layout + +\begin_layout Standard +There are three problems with efficient fragmentation-avoiding allocators: + they are non-trivial, they tend to use a single free list for each size, + and there's no evidence that tdb allocation patterns will match those recorded + for general allocators (though it seems likely). +\end_layout + +\begin_layout Standard +Thus we don't spend too much effort on external fragmentation; we will be + no worse than the current code if we need to repack on occasion. + More effort is spent on reducing freelist contention, and reducing overhead. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "sub:Records-Incur-A" + +\end_inset + +Records Incur A 28-Byte Overhead +\end_layout + +\begin_layout Standard +Each TDB record has a header as follows: +\end_layout + +\begin_layout LyX-Code +struct tdb_record { +\end_layout + +\begin_layout LyX-Code + tdb_off_t next; /* offset of the next record in the list */ +\end_layout + +\begin_layout LyX-Code + tdb_len_t rec_len; /* total byte length of record */ +\end_layout + +\begin_layout LyX-Code + tdb_len_t key_len; /* byte length of key */ +\end_layout + +\begin_layout LyX-Code + tdb_len_t data_len; /* byte length of data */ +\end_layout + +\begin_layout LyX-Code + uint32_t full_hash; /* the full 32 bit hash of the key */ +\end_layout + +\begin_layout LyX-Code + uint32_t magic; /* try to catch errors */ +\end_layout + +\begin_layout LyX-Code + /* the following union is implied: +\end_layout + +\begin_layout LyX-Code + union { +\end_layout + +\begin_layout LyX-Code + char record[rec_len]; +\end_layout + +\begin_layout LyX-Code + struct { +\end_layout + +\begin_layout LyX-Code + char key[key_len]; +\end_layout + +\begin_layout LyX-Code + char data[data_len]; +\end_layout + +\begin_layout LyX-Code + } +\end_layout + +\begin_layout LyX-Code + uint32_t totalsize; (tailer) +\end_layout + +\begin_layout LyX-Code + } +\end_layout + +\begin_layout LyX-Code + */ +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard +Naively, this would double to a 56-byte overhead on a 64 bit implementation. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +We can use various techniques to reduce this for an allocated block: +\end_layout + +\begin_layout Enumerate +The 'next' pointer is not required, as we are using a flat hash table. +\end_layout + +\begin_layout Enumerate +'rec_len' can instead be expressed as an addition to key_len and data_len + (it accounts for wasted or overallocated length in the record). + Since the record length is always a multiple of 8, we can conveniently + fit it in 32 bits (representing up to 35 bits). +\end_layout + +\begin_layout Enumerate +'key_len' and 'data_len' can be reduced. + I'm unwilling to restrict 'data_len' to 32 bits, but instead we can combine + the two into one 64-bit field and using a 5 bit value which indicates at + what bit to divide the two. + Keys are unlikely to scale as fast as data, so I'm assuming a maximum key + size of 32 bits. +\end_layout + +\begin_layout Enumerate +'full_hash' is used to avoid a memcmp on the +\begin_inset Quotes eld +\end_inset + +miss +\begin_inset Quotes erd +\end_inset + + case, but this is diminishing returns after a handful of bits (at 10 bits, + it reduces 99.9% of false memcmp). + As an aside, as the lower bits are already incorporated in the hash table + resolution, the upper bits should be used here. + Note that it's not clear that these bits will be a win, given the extra + bits in the hash table itself (see +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:Hash-Size-Solution" + +\end_inset + +). +\end_layout + +\begin_layout Enumerate +'magic' does not need to be enlarged: it currently reflects one of 5 values + (used, free, dead, recovery, and unused_recovery). + It is useful for quick sanity checking however, and should not be eliminated. +\end_layout + +\begin_layout Enumerate +'tailer' is only used to coalesce free blocks (so a block to the right can + find the header to check if this block is free). + This can be replaced by a single 'free' bit in the header of the following + block (and the tailer only exists in free blocks). +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +This technique from Thomas Standish. + Data Structure Techniques. + Addison-Wesley, Reading, Massachusetts, 1980. +\end_layout + +\end_inset + + The current proposed coalescing algorithm doesn't need this, however. +\end_layout + +\begin_layout Standard +This produces a 16 byte used header like this: +\end_layout + +\begin_layout LyX-Code +struct tdb_used_record { +\end_layout + +\begin_layout LyX-Code + uint32_t used_magic : 16, +\end_layout + +\begin_layout LyX-Code + +\end_layout + +\begin_layout LyX-Code + key_data_divide: 5, +\end_layout + +\begin_layout LyX-Code + top_hash: 11; +\end_layout + +\begin_layout LyX-Code + uint32_t extra_octets; +\end_layout + +\begin_layout LyX-Code + uint64_t key_and_data_len; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard +And a free record like this: +\end_layout + +\begin_layout LyX-Code +struct tdb_free_record { +\end_layout + +\begin_layout LyX-Code + uint64_t free_magic: 8, +\end_layout + +\begin_layout LyX-Code + prev : 56; +\end_layout + +\begin_layout LyX-Code + +\end_layout + +\begin_layout LyX-Code + uint64_t free_table: 8, +\end_layout + +\begin_layout LyX-Code + total_length : 56 +\end_layout + +\begin_layout LyX-Code + uint64_t next;; +\end_layout + +\begin_layout LyX-Code +}; +\end_layout + +\begin_layout Standard + +\change_deleted 0 1291206079 + +\change_unchanged +Note that by limiting valid offsets to 56 bits, we can pack everything we + need into 3 64-byte words, meaning our minimum record size is 8 bytes. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Transaction Commit Requires 4 fdatasync +\end_layout + +\begin_layout Standard +The current transaction algorithm is: +\end_layout + +\begin_layout Enumerate +write_recovery_data(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Enumerate +write_recovery_header(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Enumerate +overwrite_with_new_data(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Enumerate +remove_recovery_header(); +\end_layout + +\begin_layout Enumerate +sync(); +\end_layout + +\begin_layout Standard +On current ext3, each sync flushes all data to disk, so the next 3 syncs + are relatively expensive. + But this could become a performance bottleneck on other filesystems such + as ext4. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Neil Brown points out that this is overzealous, and only one sync is needed: +\end_layout + +\begin_layout Enumerate +Bundle the recovery data, a transaction counter and a strong checksum of + the new data. +\end_layout + +\begin_layout Enumerate +Strong checksum that whole bundle. +\end_layout + +\begin_layout Enumerate +Store the bundle in the database. +\end_layout + +\begin_layout Enumerate +Overwrite the oldest of the two recovery pointers in the header (identified + using the transaction counter) with the offset of this bundle. +\end_layout + +\begin_layout Enumerate +sync. +\end_layout + +\begin_layout Enumerate +Write the new data to the file. +\end_layout + +\begin_layout Standard +Checking for recovery means identifying the latest bundle with a valid checksum + and using the new data checksum to ensure that it has been applied. + This is more expensive than the current check, but need only be done at + open. + For running databases, a separate header field can be used to indicate + a transaction in progress; we need only check for recovery if this is set. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "sub:TDB-Does-Not" + +\end_inset + +TDB Does Not Have Snapshot Support +\end_layout + +\begin_layout Subsubsection +Proposed SolutionNone. + At some point you say +\begin_inset Quotes eld +\end_inset + +use a real database +\begin_inset Quotes erd +\end_inset + + (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). +\end_layout + +\begin_layout Standard +But as a thought experiment, if we implemented transactions to only overwrite + free entries (this is tricky: there must not be a header in each entry + which indicates whether it is free, but use of presence in metadata elsewhere), + and a pointer to the hash table, we could create an entirely new commit + without destroying existing data. + Then it would be easy to implement snapshots in a similar way. +\end_layout + +\begin_layout Standard +This would not allow arbitrary changes to the database, such as tdb_repack + does, and would require more space (since we have to preserve the current + and future entries at once). + If we used hash trees rather than one big hash table, we might only have + to rewrite some sections of the hash, too. +\end_layout + +\begin_layout Standard +We could then implement snapshots using a similar method, using multiple + different hash tables/free tables. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Subsection +Transactions Cannot Operate in Parallel +\end_layout + +\begin_layout Standard +This would be useless for ldb, as it hits the index records with just about + every update. + It would add significant complexity in resolving clashes, and cause the + all transaction callers to write their code to loop in the case where the + transactions spuriously failed. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None (but see +\begin_inset CommandInset ref +LatexCommand ref +reference "replay-attribute" + +\end_inset + +). + We could solve a small part of the problem by providing read-only transactions. + These would allow one write transaction to begin, but it could not commit + until all r/o transactions are done. + This would require a new RO_TRANSACTION_LOCK, which would be upgraded on + commit. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\begin_layout Subsection +Default Hash Function Is Suboptimal +\end_layout + +\begin_layout Standard +The Knuth-inspired multiplicative hash used by tdb is fairly slow (especially + if we expand it to 64 bits), and works best when the hash bucket size is + a prime number (which also means a slow modulus). + In addition, it is highly predictable which could potentially lead to a + Denial of Service attack in some TDB uses. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +The Jenkins lookup3 hash +\begin_inset Foot +status open + +\begin_layout Plain Layout +http://burtleburtle.net/bob/c/lookup3.c +\end_layout + +\end_inset + + is a fast and superbly-mixing hash. + It's used by the Linux kernel and almost everything else. + This has the particular properties that it takes an initial seed, and produces + two 32 bit hash numbers, which we can combine into a 64-bit hash. +\end_layout + +\begin_layout Standard +The seed should be created at tdb-creation time from some random source, + and placed in the header. + This is far from foolproof, but adds a little bit of protection against + hash bombing. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +\begin_inset CommandInset label +LatexCommand label +name "Reliable-Traversal-Adds" + +\end_inset + +Reliable Traversal Adds Complexity +\end_layout + +\begin_layout Standard +We lock a record during traversal iteration, and try to grab that lock in + the delete code. + If that grab on delete fails, we simply mark it deleted and continue onwards; + traversal checks for this condition and does the delete when it moves off + the record. +\end_layout + +\begin_layout Standard +If traversal terminates, the dead record may be left indefinitely. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +Remove reliability guarantees; see +\begin_inset CommandInset ref +LatexCommand ref +reference "traverse-Proposed-Solution" + +\end_inset + +. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Complete. +\end_layout + +\begin_layout Subsection +Fcntl Locking Adds Overhead +\end_layout + +\begin_layout Standard +Placing a fcntl lock means a system call, as does removing one. + This is actually one reason why transactions can be faster (everything + is locked once at transaction start). + In the uncontended case, this overhead can theoretically be eliminated. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. +\end_layout + +\begin_layout Standard +We tried this before with spinlock support, in the early days of TDB, and + it didn't make much difference except in manufactured benchmarks. +\end_layout + +\begin_layout Standard +We could use spinlocks (with futex kernel support under Linux), but it means + that we lose automatic cleanup when a process dies with a lock. + There is a method of auto-cleanup under Linux, but it's not supported by + other operating systems. + We could reintroduce a clear-if-first-style lock and sweep for dead futexes + on open, but that wouldn't help the normal case of one concurrent opener + dying. + Increasingly elaborate repair schemes could be considered, but they require + an ABI change (everyone must use them) anyway, so there's no need to do + this at the same time as everything else. +\end_layout + +\begin_layout Subsection +Some Transactions Don't Require Durability +\end_layout + +\begin_layout Standard +Volker points out that gencache uses a CLEAR_IF_FIRST tdb for normal (fast) + usage, and occasionally empties the results into a transactional TDB. + This kind of usage prioritizes performance over durability: as long as + we are consistent, data can be lost. +\end_layout + +\begin_layout Standard +This would be more neatly implemented inside tdb: a +\begin_inset Quotes eld +\end_inset + +soft +\begin_inset Quotes erd +\end_inset + + transaction commit (ie. + syncless) which meant that data may be reverted on a crash. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\end_layout + +\begin_layout Standard +None. +\end_layout + +\begin_layout Standard +Unfortunately any transaction scheme which overwrites old data requires + a sync before that overwrite to avoid the possibility of corruption. +\end_layout + +\begin_layout Standard +It seems possible to use a scheme similar to that described in +\begin_inset CommandInset ref +LatexCommand ref +reference "sub:TDB-Does-Not" + +\end_inset + +,where transactions are committed without overwriting existing data, and + an array of top-level pointers were available in the header. + If the transaction is +\begin_inset Quotes eld +\end_inset + +soft +\begin_inset Quotes erd +\end_inset + + then we would not need a sync at all: existing processes would pick up + the new hash table and free list and work with that. +\end_layout + +\begin_layout Standard +At some later point, a sync would allow recovery of the old data into the + free lists (perhaps when the array of top-level pointers filled). + On crash, tdb_open() would examine the array of top levels, and apply the + transactions until it encountered an invalid checksum. +\end_layout + +\begin_layout Subsection +Tracing Is Fragile, Replay Is External +\end_layout + +\begin_layout Standard +The current TDB has compile-time-enabled tracing code, but it often breaks + as it is not enabled by default. + In a similar way, the ctdb code has an external wrapper which does replay + tracing so it can coordinate cluster-wide transactions. +\end_layout + +\begin_layout Subsubsection +Proposed Solution +\begin_inset CommandInset label +LatexCommand label +name "replay-attribute" + +\end_inset + + +\end_layout + +\begin_layout Standard +Tridge points out that an attribute can be later added to tdb_open (see + +\begin_inset CommandInset ref +LatexCommand ref +reference "attributes" + +\end_inset + +) to provide replay/trace hooks, which could become the basis for this and + future parallel transactions and snapshot support. +\end_layout + +\begin_layout Subsubsection +Status +\end_layout + +\begin_layout Standard +Deferred. +\end_layout + +\end_body +\end_document +@ + + +1.12 +log +@Add status, some fixes, linked freelists. +@ +text +@d53 1 +a53 7 + +\change_deleted 0 1291204535 +14-September +\change_inserted 0 1291204533 +1-December +\change_unchanged +-2010 +a580 2 +\change_inserted 0 1291204563 + +a583 2 + +\change_inserted 0 1291204572 +a587 2 + +\change_inserted 0 1291204573 +a588 2 +\change_unchanged + +a629 2 +\change_inserted 0 1291204588 + +a632 2 + +\change_inserted 0 1291204588 +a636 2 + +\change_inserted 0 1291204631 +a639 2 +\change_unchanged + +a693 2 +\change_inserted 0 1291204639 + +a696 2 + +\change_inserted 0 1291204640 +d702 1 +a702 1 +\change_inserted 0 1291204665 +d704 2 +a728 2 +\change_inserted 0 1291204671 + +a731 2 + +\change_inserted 0 1291204671 +a735 2 + +\change_inserted 0 1291204673 +a736 2 +\change_unchanged + +a780 2 +\change_inserted 0 1291204731 + +a783 2 + +\change_inserted 0 1291204732 +a787 2 + +\change_inserted 0 1291204779 +a790 2 +\change_unchanged + +a842 2 +\change_inserted 0 1291204830 + +a845 2 + +\change_inserted 0 1291204831 +a849 2 + +\change_inserted 0 1291204834 +a850 2 +\change_unchanged + +d879 9 +a887 2 + deal of churn; we are better to guarantee that the tdb_errcode is per-thread + so the current programming model can be maintained. +d891 9 +d903 2 +a922 2 +\change_inserted 0 1291204847 + +a925 2 + +\change_inserted 0 1291204847 +d930 5 +a934 3 + +\change_inserted 0 1291204852 +Incomplete. +a1051 2 +\change_inserted 0 1291204881 + +a1054 2 + +\change_inserted 0 1291204881 +a1058 2 + +\change_inserted 0 1291204885 +a1059 2 +\change_unchanged + +a1140 2 +\change_inserted 0 1291204898 + +a1143 2 + +\change_inserted 0 1291204898 +a1147 2 + +\change_inserted 0 1291204901 +a1148 2 +\change_unchanged + +a1224 2 +\change_inserted 0 1291204908 + +a1227 2 + +\change_inserted 0 1291204908 +a1231 2 + +\change_inserted 0 1291204908 +a1232 2 +\change_unchanged + +a1271 2 +\change_inserted 0 1291204917 + +a1274 2 + +\change_inserted 0 1291204917 +a1278 2 + +\change_inserted 0 1291204920 +a1279 2 +\change_unchanged + +a1316 2 +\change_inserted 0 1291204927 + +a1319 2 + +\change_inserted 0 1291204928 +d1325 1 +a1325 1 +\change_inserted 0 1291204942 +d1327 2 +a1381 2 +\change_inserted 0 1291205003 + +a1384 2 + +\change_inserted 0 1291205004 +a1388 2 + +\change_inserted 0 1291205007 +a1411 2 +\change_inserted 0 1291205019 + +a1414 2 + +\change_inserted 0 1291205019 +a1418 2 + +\change_inserted 0 1291205023 +a1419 2 +\change_unchanged + +a1465 2 +\change_inserted 0 1291205029 + +a1468 2 + +\change_inserted 0 1291205029 +a1472 2 + +\change_inserted 0 1291206020 +a1473 2 +\change_unchanged + +a1528 2 +\change_inserted 0 1291205043 + +a1531 2 + +\change_inserted 0 1291205043 +d1537 1 +a1537 1 +\change_inserted 0 1291205057 +d1539 2 +a1589 2 +\change_inserted 0 1291205062 + +a1592 2 + +\change_inserted 0 1291205062 +a1596 2 + +\change_inserted 0 1291205062 +a1597 2 +\change_unchanged + +a1626 2 +\change_inserted 0 1291205072 + +a1629 2 + +\change_inserted 0 1291205073 +a1633 2 + +\change_inserted 0 1291205073 +a1634 2 +\change_unchanged + +a1674 4 + +\change_deleted 0 1291204504 + +\change_unchanged +a1699 2 +\change_inserted 0 1291205079 + +a1702 2 + +\change_inserted 0 1291205080 +a1706 2 + +\change_inserted 0 1291205080 +a1707 2 +\change_unchanged + +a1833 2 +\change_inserted 0 1291205090 + +d1869 2 +a1870 7 + is to divide the file into zones, and using a free list (or +\change_inserted 0 1291205498 +table +\change_deleted 0 1291205497 +set +\change_unchanged + of free lists) for each. +a1871 2 +\change_inserted 0 1291205203 + +a1874 2 + +\change_inserted 0 1291205358 +a1890 21 +\change_unchanged + +\end_layout + +\begin_layout Standard + +\change_deleted 0 1291205198 +Note that this means we need to split the free lists when we expand the + file; this is probably acceptable when we double the hash table size, since + that is such an expensive operation already. + In the case of increasing the file size, there is an optimization we can + use: if we use M in the formula above as the file size rounded up to the + next power of 2, we only need reshuffle free lists when the file size crosses + a power of 2 boundary, +\emph on +and +\emph default +reshuffling the free lists is trivial: we simply merge every consecutive + pair of free lists. +\change_unchanged + +d1899 1 +a1899 7 +Identify the correct +\change_inserted 0 1291205366 +free list +\change_deleted 0 1291205364 +zone +\change_unchanged +. +d1907 2 +a1908 7 +Re-check the +\change_inserted 0 1291205372 +list +\change_deleted 0 1291205371 +zone +\change_unchanged + (we didn't have a lock, sizes could have changed): relock if necessary. +d1912 1 +a1912 5 +Place the freed entry in the list +\change_deleted 0 1291205382 + for that zone +\change_unchanged +. +d1921 1 +a1921 15 +Pick a +\change_deleted 0 1291205403 +zone either the zone we last freed into, or based on a +\begin_inset Quotes eld +\end_inset + +random +\begin_inset Quotes erd +\end_inset + + number. +\change_inserted 0 1291205411 +free table; usually the previous one. +\change_unchanged + +a1925 10 +\change_deleted 0 1291205432 + +\end_layout + +\begin_layout Enumerate + +\change_deleted 0 1291205428 +Re-check the zone: relock if necessary. +\change_unchanged + +d1934 1 +a1934 7 + unlock the list and try the next +\change_inserted 0 1291205455 +largest list +\change_deleted 0 1291205452 +zone. +\change_inserted 0 1291205457 + +a1937 2 + +\change_inserted 0 1291205476 +a1938 2 +\change_unchanged + +a1966 2 +\change_inserted 0 1291205542 + +a1969 2 + +\change_inserted 0 1291205591 +a1971 70 +\change_unchanged + +\end_layout + +\begin_layout Standard + +\change_deleted 0 1291205539 +I anticipate that the number of entries in each free zone would be small, + but it might be worth using one free entry to hold pointers to the others + for cache efficiency. +\change_unchanged + +\end_layout + +\begin_layout Standard + +\change_deleted 0 1291205534 +\begin_inset CommandInset label +LatexCommand label +name "freelist-in-zone" + +\end_inset + +If we want to avoid locking complexity (enlarging the free lists when we + enlarge the file) we could place the array of free lists at the beginning + of each zone. + This means existing array lists never move, but means that a record cannot + be larger than a zone. + That in turn implies that zones should be variable sized (say, power of + 2), which makes the question +\begin_inset Quotes eld +\end_inset + +what zone is this record in? +\begin_inset Quotes erd +\end_inset + + much harder (and +\begin_inset Quotes eld +\end_inset + +pick a random zone +\begin_inset Quotes erd +\end_inset + +, but that's less common). + It could be done with as few as 4 bits from the record header. +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout +Using +\begin_inset Formula $2^{16+N*3}$ +\end_inset + +means 0 gives a minimal 65536-byte zone, 15 gives the maximal +\begin_inset Formula $2^{61}$ +\end_inset + + byte zone. + Zones range in factor of 8 steps. + Given the zone size for the zone the current record is in, we can determine + the start of the zone. +\end_layout + +\end_inset + + +\change_inserted 0 1291205139 + +d2218 1 +a2218 5 + uint32_t +\change_inserted 0 1291205758 +used_ +\change_unchanged +magic : 16, +a2222 4 +\change_deleted 0 1291205693 + prev_is_free: 1, +\change_unchanged + +d2230 1 +a2230 7 + top_hash: 1 +\change_inserted 0 1291205704 +1 +\change_deleted 0 1291205704 +0 +\change_unchanged +; +d2254 1 +a2254 9 + uint +\change_inserted 0 1291205725 +64 +\change_deleted 0 1291205723 +32 +\change_unchanged +_t +\change_inserted 0 1291205753 +free_magic: 8, +a2257 2 + +\change_inserted 0 1291205746 +a2262 24 +\change_deleted 0 1291205749 +free_magic; +\change_unchanged + +\end_layout + +\begin_layout LyX-Code + uint64_t +\change_inserted 0 1291205786 +free_table: 8, +\end_layout + +\begin_layout LyX-Code + +\change_inserted 0 1291205788 + +\change_unchanged +total_length +\change_inserted 0 1291205792 + : 56 +\change_deleted 0 1291205790 +; +\change_unchanged + +d2266 1 +a2266 7 + uint64_t +\change_deleted 0 1291205801 +prev, +\change_unchanged +next; +\change_deleted 0 1291205811 + +d2270 1 +a2270 3 + +\change_deleted 0 1291205811 + ... +d2274 1 +a2274 5 + +\change_deleted 0 1291205808 + uint64_t tailer +\change_unchanged +; +d2283 5 +a2287 16 +\change_deleted 0 1291205827 +We might want to take some bits from the used record's top_hash (and the + free record which has 32 bits of padding to spare anyway) if we use variable + sized zones. + See +\begin_inset CommandInset ref +LatexCommand ref +reference "freelist-in-zone" + +\end_inset + +. + +\change_inserted 0 1291205885 + Note that by limiting valid offsets to 56 bits, we can pack everything + we need into 3 64-byte words, meaning our minimum record size is 8 bytes. +a2290 2 + +\change_inserted 0 1291205886 +a2294 2 + +\change_inserted 0 1291205886 +a2295 2 +\change_unchanged + +a2385 2 +\change_inserted 0 1291205894 + +a2388 2 + +\change_inserted 0 1291205894 +a2392 2 + +\change_inserted 0 1291205902 +a2393 2 +\change_unchanged + +a2415 4 + +\change_deleted 0 1291204504 + +\change_unchanged +a2445 2 +\change_inserted 0 1291205910 + +a2448 2 + +\change_inserted 0 1291205910 +a2452 2 + +\change_inserted 0 1291205914 +a2453 2 +\change_unchanged + +a2485 2 +\change_inserted 0 1291205919 + +a2488 2 + +\change_inserted 0 1291205919 +a2492 2 + +\change_inserted 0 1291205922 +a2493 2 +\change_unchanged + +a2533 2 +\change_inserted 0 1291205929 + +a2536 2 + +\change_inserted 0 1291205929 +a2540 2 + +\change_inserted 0 1291205929 +a2541 2 +\change_unchanged + +a2578 2 +\change_inserted 0 1291205932 + +a2581 2 + +\change_inserted 0 1291205933 +a2585 2 + +\change_inserted 0 1291205933 +a2586 2 +\change_unchanged + +a2724 2 +\change_inserted 0 1291205944 + +a2727 2 + +\change_inserted 0 1291205945 +a2731 2 + +\change_inserted 0 1291205948 +a2732 2 +\change_unchanged + +@ + + +1.11 +log +@Merge changes +@ +text +@d53 7 +a59 1 +14-September-2010 +d587 16 +d644 18 +d716 16 +d753 16 +d813 18 +d883 16 +d953 16 +d1084 16 +d1181 16 +d1273 16 +d1328 16 +d1381 16 +d1447 19 +a1465 2 + if older code (which doesn't understand the feature) writes to the database.Reco +rd Headers Are Not Expandible +d1484 16 +d1546 16 +d1617 16 +d1680 16 +d1725 16 +d1810 16 +d1951 8 +a1958 3 +Proposed SolutionThe first step is to remove all the current heuristics, + as they obviously interact, then examine them once the lock contention + is addressed. +d1989 7 +a1995 2 + is to divide the file into zones, and using a free list (or set of free + lists) for each. +d1997 2 +d2002 25 +d2039 2 +d2049 7 +a2055 1 +Identify the correct zone. +d2063 7 +a2069 2 +Re-check the zone (we didn't have a lock, sizes could have changed): relock + if necessary. +d2073 5 +a2077 1 +Place the freed entry in the list for that zone. +d2086 3 +a2088 1 +Pick a zone either the zone we last freed into, or based on a +d2097 4 +d2105 2 +d2110 2 +d2113 2 +d2123 15 +a2137 1 + unlock the list and try the next zone. +d2166 11 +d2180 2 +d2185 2 +d2190 2 +d2223 1 +a2223 1 +status open +d2243 2 +d2491 5 +a2495 1 + uint32_t magic : 16, +d2499 2 +d2502 2 +d2511 7 +a2517 1 + top_hash: 10; +d2541 29 +a2569 1 + uint32_t free_magic; +d2573 11 +a2583 1 + uint64_t total_length; +d2587 7 +a2593 1 + uint64_t prev, next; +d2597 2 +d2603 5 +a2607 1 + uint64_t tailer; +d2615 2 +d2628 18 +d2736 16 +d2808 16 +d2856 16 +d2912 16 +d2965 16 +d3119 16 +@ + + +1.10 +log +@Tracing attribute, talloc support. +@ +text +@d1 1 +a1 1 +#LyX 1.6.5 created this file. For more info see http://www.lyx.org/ +d53 1 +a53 7 + +\change_deleted 0 1283307542 +26-July +\change_inserted 0 1284423485 +14-September +\change_unchanged +-2010 +a472 2 +\change_inserted 0 1284422789 + +a479 2 +\change_unchanged + +a838 2 + +\change_inserted 0 1284016998 +a846 2 +\change_unchanged + +a1194 2 +\change_inserted 0 1284015637 + +a1197 2 + +\change_inserted 0 1284015716 +a1201 2 + +\change_inserted 0 1284015906 +a1210 2 + +\change_inserted 0 1284015637 +a1214 2 + +\change_inserted 0 1284016114 +a1227 2 + +\change_inserted 0 1284016149 +a1232 2 + +\change_inserted 0 1284016639 +a1237 2 + +\change_inserted 0 1284016821 +a1243 2 + +\change_inserted 0 1284016803 +d1245 2 +a1246 9 + if older code (which doesn't understand the feature) writes to the database. +\change_deleted 0 1284016101 + +\end_layout + +\begin_layout Subsection + +\change_inserted 0 1284015634 +Record Headers Are Not Expandible +a1249 2 + +\change_inserted 0 1284015634 +a1254 2 + +\change_inserted 0 1284015634 +a1258 2 + +\change_inserted 0 1284422552 +a1267 2 + +\change_inserted 0 1284422568 +a1271 2 + +\change_inserted 0 1284422646 +a1276 2 + +\change_inserted 0 1284422656 +a1280 2 + +\change_inserted 0 1284423065 +a1305 2 + +\change_inserted 0 1284423042 +a1310 2 +\change_unchanged + +a1457 2 + +\change_inserted 0 1283336713 +a1463 2 + +\change_unchanged +d1482 2 +d1485 1 +a1485 51 +\change_deleted 0 1283307675 +There are three details which become important: +\end_layout + +\begin_layout Enumerate + +\change_deleted 0 1283307675 +On encountering a full bucket, we use the next bucket. +\end_layout + +\begin_layout Enumerate + +\change_deleted 0 1283307675 +Extra hash bits are stored with the offset, to reduce comparisons. +\end_layout + +\begin_layout Enumerate + +\change_deleted 0 1283307675 +A marker entry is used on deleting an entry. +\end_layout + +\begin_layout Standard + +\change_deleted 0 1283307675 +The doubling of the table must be done under a transaction; we will not + reduce it on deletion, so it will be an unusual case. + It will either be placed at the head (other entries will be moved out the + way so we can expand). + We could have a pointer in the header to the current hashtable location, + but that pointer would have to be read frequently to check for hashtable + moves. +\end_layout + +\begin_layout Standard + +\change_deleted 0 1283307675 +The locking for this is slightly more complex than the chained case; we + currently have one lock per bucket, and that means we would need to expand + the lock if we overflow to the next bucket. + The frequency of such collisions will effect our locking heuristics: we + can always lock more buckets than we need. +\end_layout + +\begin_layout Standard + +\change_deleted 0 1283307675 +One possible optimization is to only re-check the hash size on an insert + or a lookup miss. + +\change_inserted 0 1283307770 +a1492 2 + +\change_inserted 0 1283336187 +a1500 2 + +\change_inserted 0 1283336586 +a1510 2 +\change_unchanged + +d1636 3 +a1638 8 +Proposed Solution +\change_deleted 0 1283336858 + +\end_layout + +\begin_layout Standard +The first step is to remove all the current heuristics, as they obviously + interact, then examine them once the lock contention is addressed. +a1647 2 +\change_inserted 0 1283336910 + +a1650 2 + +\change_inserted 0 1283337052 +a1655 2 +\change_unchanged + +a1776 2 +\change_inserted 0 1283309850 + +a1779 2 + +\change_inserted 0 1283337216 +a1813 2 + +\change_inserted 0 1284424151 +a1825 2 +\change_unchanged + +a1830 2 +\change_unchanged + +a2031 2 + +\change_inserted 0 1283336739 +a2040 2 +\change_unchanged + +a2117 2 +\change_inserted 0 1283337133 + +a2120 2 + +\change_inserted 0 1283337139 +a2121 2 +\change_unchanged + +a2136 2 + +\change_inserted 0 1283337235 +a2147 2 +\change_unchanged + +d2251 1 +a2251 7 +Proposed Solution +\change_deleted 0 1284423472 + +\end_layout + +\begin_layout Standard +None. +d2261 1 +a2261 1 +\change_inserted 0 1284423891 +d2263 1 +a2263 4 +\change_deleted 0 1284423891 +. + +\change_inserted 0 1284423901 +a2271 2 +\change_unchanged + +a2293 2 +\change_inserted 0 1284423495 + +a2312 2 + +\change_inserted 0 1284424201 +d2321 1 +a2321 3 + +\change_unchanged +We could solve a small part of the problem by providing read-only transactions. +a2505 2 +\change_inserted 0 1284423555 + +a2508 2 + +\change_inserted 0 1284423617 +a2512 2 + +\change_inserted 0 1284423719 +a2519 2 + +\change_inserted 0 1284423864 +a2530 2 + +\change_inserted 0 1284423850 +a2540 2 +\change_unchanged + +@ + + +1.9 +log +@Extension mechanism. +@ +text +@d56 2 +a57 2 +\change_inserted 0 1284016854 +9-September +d479 11 +d1303 1 +a1303 1 +\change_inserted 0 1284016847 +d1310 56 +d1945 1 +a1945 1 +\change_inserted 0 1283310945 +d1956 2 +d2402 2 +d2416 4 +d2421 12 +d2455 2 +d2476 12 +d2673 47 +@ + + +1.8 +log +@Remove bogus footnote +@ +text +@d56 2 +a57 2 +\change_inserted 0 1283307544 +1-September +d838 12 +d1198 103 +@ + + +1.7 +log +@Moving hash table does not work. +@ +text +@a1436 12 +\begin_inset Foot +status collapsed + +\begin_layout Plain Layout + +\change_inserted 0 1283336450 +If we make the hash offsets zone-relative, then this only restricts the + zone size, not the overall database size. +\end_layout + +\end_inset + +@ + + +1.6 +log +@Commit changes +@ +text +@d38 1 +a38 1 +\author "" +d53 7 +a59 1 +26-July-2010 +d1333 10 +d1361 3 +a1363 1 + There are three details which become important: +d1367 2 +d1373 2 +d1379 2 +d1385 2 +d1397 2 +d1407 2 +d1411 45 +d1582 2 +d1598 14 +d1733 62 +d1996 13 +d2086 10 +d2110 15 +a2124 1 +\begin_layout LyX-Code +@ + + +1.5 +log +@Soft transaction commit +@ +text +@d38 1 +a38 1 +\author "Rusty Russell,,," +a52 4 + +\change_deleted 0 1280141199 +10-May-2010 +\change_inserted 0 1280141202 +a53 2 +\change_unchanged + +a2028 2 + +\change_inserted 0 1280140902 +a2034 2 + +\change_unchanged +a2212 2 +\change_inserted 0 1280140661 + +a2215 2 + +\change_inserted 0 1280140703 +a2219 2 + +\change_inserted 0 1280708312 +a2226 2 + +\change_inserted 0 1280708400 +a2239 2 + +\change_inserted 0 1280140836 +a2243 2 + +\change_inserted 0 1280708255 +a2247 2 + +\change_inserted 0 1280708374 +a2252 2 + +\change_inserted 0 1280141181 +a2274 2 + +\change_inserted 0 1280141345 +@ + + +1.4 +log +@Merge changes +@ +text +@d38 1 +a38 1 +\author "" +d53 2 +d56 4 +d2035 10 +d2223 84 +@ + + +1.3 +log +@Transaction and freelist rethink. +@ +text +@d38 1 +a38 1 +\author "Rusty Russell,,," +d53 1 +a53 1 +27-April-2010 +d662 1 +a662 5 + behavior of disallowing +\change_inserted 0 1272940179 +nested +\change_unchanged +transactions should become the default. +a1210 2 +\change_inserted 0 1272944650 + +a1214 2 + +\change_inserted 0 1272944763 +a1218 2 +\change_unchanged + +a1223 2 +\change_unchanged + +a1301 2 + +\change_inserted 0 1273478114 +a1310 2 +\change_unchanged + +d1515 1 +a1515 11 +The free list +\change_deleted 0 1273469807 +should +\change_inserted 0 1273469810 +must +\change_unchanged + be split +\change_deleted 0 1273469815 +into multiple lists +\change_unchanged +to reduce contention. +a1520 2 +\change_inserted 0 1273470006 + +a1523 2 + +\change_inserted 0 1273492055 +a1539 2 + +\change_inserted 0 1273483888 +a1551 2 +\change_unchanged + +a1554 8 + +\change_deleted 0 1272942055 +There are various ways to organize these lisys, but because we want to be + able to quickly identify which free list an entry is in, and reduce the + number of locks required for merging, we will use zoning (eg. + each free list covers some fixed fraction of the file). + +\change_inserted 0 1273484187 +d1556 1 +a1556 7 + +\change_deleted 0 1273484194 +The algorithm for f +\change_inserted 0 1273484194 +F +\change_unchanged +reeing is simple: +d1560 1 +a1560 7 +Identify the correct +\change_deleted 0 1273482856 +free list +\change_inserted 0 1273482857 +zone +\change_unchanged +. +d1564 1 +a1564 7 +Lock the +\change_inserted 0 1273482895 +corresponding +\change_unchanged +list +\change_inserted 0 1273482863 +. +a1567 2 + +\change_inserted 0 1273482909 +d1573 1 +a1573 13 + +\change_deleted 0 1273482885 +, and p +\change_inserted 0 1273482888 +P +\change_unchanged +lace the freed entry +\change_deleted 0 1273492415 +at the head +\change_inserted 0 1273492415 +in the list for that zone +\change_unchanged +. +d1577 2 +a1578 7 +Allocation is a little more complicated, as we +\change_deleted 0 1273483240 +merge entries as we walk the list: +\change_inserted 0 1273484250 +perform delayed coalescing at this point: +\change_unchanged + +d1582 1 +a1582 19 +Pick a +\change_deleted 0 1273482955 +free list; +\change_inserted 0 1273482957 +zone +\change_unchanged + either the +\change_deleted 0 1273482962 +list +\change_inserted 0 1273482962 +zone +\change_unchanged + we last freed +\change_deleted 0 1273482966 +o +\change_inserted 0 1273482966 +i +\change_unchanged +nto, or based on a +d1594 1 +a1594 9 +Lock th +\change_inserted 0 1273482980 +e corresponding +\change_deleted 0 1273482973 +at +\change_unchanged + list. +\change_inserted 0 1273482982 + +a1597 2 + +\change_inserted 0 1273483084 +a1598 53 +\change_unchanged + +\end_layout + +\begin_layout Enumerate +If the top entry is +\change_deleted 0 1273492155 +well-sized, +\change_inserted 0 1273492159 +-large enough, +\change_unchanged +remove it from the list and return it. +\end_layout + +\begin_layout Enumerate +Otherwise, +\change_inserted 0 1273492206 +coalesce entries in the list. +\change_deleted 0 1273492200 +examine the entry to the right of it in the file. + If it is free: +\end_layout + +\begin_deeper +\begin_layout Enumerate + +\change_deleted 0 1273492200 +If that entry is in a different list, lock that list too. +\end_layout + +\begin_layout Enumerate + +\change_deleted 0 1273492200 +If we had to place a new lock, re-check that the entry is free. +\end_layout + +\begin_layout Enumerate + +\change_deleted 0 1273492200 +Remove that entry from its free list and expand this entry to cover it. +\end_layout + +\begin_layout Enumerate + +\change_deleted 0 1273485554 +Goto step 3. +\end_layout + +\end_deeper +\begin_layout Enumerate + +\change_inserted 0 1273485311 +If there was no entry large enough, unlock the list and try the next zone. +d1602 1 +a1602 5 + +\change_deleted 0 1273483646 +Repeat step 3 with each entry in the list. +\change_unchanged + +d1606 2 +a1607 5 + +\change_deleted 0 1273483668 +Unlock the list and repeat step 2 with the next list. +\change_unchanged + +d1611 1 +a1611 7 +If no +\change_deleted 0 1273483671 +list +\change_inserted 0 1273483671 +zone +\change_unchanged + satisfies, expand the file. +d1615 2 +a1616 9 +This optimizes rapid insert/delete of free list entries +\change_inserted 0 1273485794 + by not coalescing them all the time. +\change_deleted 0 1273483685 +, and allows us to get rid of the tailer altogether +\change_unchanged +. + +\change_inserted 0 1273492299 +a1638 39 + +\change_deleted 0 1273476840 +The question of +\begin_inset Quotes eld +\end_inset + +well-sized +\begin_inset Quotes erd +\end_inset + + free entries is more difficult: the 25% overhead works in practice for + ldb because indexes tend to expand by one record at a time. + This can be resolved by having an +\begin_inset Quotes eld +\end_inset + +expanded +\begin_inset Quotes erd +\end_inset + + bit in the header to note entries that have previously expanded, and allocating + more space for them. + Whether the +\begin_inset Quotes eld +\end_inset + +increasing slack +\begin_inset Quotes erd +\end_inset + + algorithm should be implemented or first-fit used is still unknown: we + will determine this once these other ideas are implemented. +\change_inserted 0 1273483750 + +\end_layout + +\begin_layout Standard + +\change_inserted 0 1273492450 +a1644 2 + +\change_inserted 0 1273470441 +a1654 2 + +\change_inserted 0 1273476556 +a1659 2 + +\change_inserted 0 1273470423 +a1661 2 +\change_unchanged + +a1672 2 + +\change_inserted 0 1273476847 +a1676 2 + +\change_inserted 0 1273476886 +a1691 2 + +\change_inserted 0 1273477233 +a1699 2 + +\change_inserted 0 1273477534 +a1706 2 + +\change_inserted 0 1273482700 +a1712 2 + +\change_inserted 0 1273478079 +a1722 2 + +\change_inserted 0 1273477839 +a1726 2 + +\change_inserted 0 1273477925 +a1730 2 + +\change_inserted 0 1273477925 +a1734 2 + +\change_inserted 0 1273477925 +a1738 2 + +\change_inserted 0 1273477925 +a1742 2 + +\change_inserted 0 1273477925 +a1746 2 + +\change_inserted 0 1273477925 +a1750 2 + +\change_inserted 0 1273477925 +a1754 2 + +\change_inserted 0 1273477925 +a1758 2 + +\change_inserted 0 1273477925 +a1762 2 + +\change_inserted 0 1273477925 +a1766 2 + +\change_inserted 0 1273477925 +a1770 2 + +\change_inserted 0 1273477925 +a1774 2 + +\change_inserted 0 1273477925 +a1778 2 + +\change_inserted 0 1273477925 +a1782 2 + +\change_inserted 0 1273477925 +a1786 2 + +\change_inserted 0 1273477925 +a1790 2 + +\change_inserted 0 1273477925 +a1794 2 + +\change_inserted 0 1273477925 +a1798 2 + +\change_inserted 0 1273492522 +a1802 2 + +\change_inserted 0 1273492530 +a1806 2 + +\change_inserted 0 1273492546 +a1810 2 + +\change_inserted 0 1273478239 +a1814 2 + +\change_inserted 0 1273479960 +a1821 2 + +\change_inserted 0 1273480265 +a1830 2 + +\change_inserted 0 1273480354 +a1845 2 + +\change_inserted 0 1273478968 +a1851 2 + +\change_inserted 0 1273492604 +a1859 2 + +\change_inserted 0 1273479572 +a1862 2 +\change_unchanged + +a1870 2 + +\change_inserted 0 1273480282 +a1874 2 + +\change_inserted 0 1273478931 +a1878 2 + +\change_inserted 0 1273481549 +a1882 2 + +\change_inserted 0 1273481557 +a1886 2 + +\change_inserted 0 1273480307 +a1890 2 + +\change_inserted 0 1273480335 +a1894 2 + +\change_inserted 0 1273479897 +a1898 2 + +\change_inserted 0 1273479653 +a1902 2 + +\change_inserted 0 1273480371 +a1906 2 + +\change_inserted 0 1273480464 +a1910 2 + +\change_inserted 0 1273480399 +a1914 2 + +\change_inserted 0 1273480425 +a1918 2 + +\change_inserted 0 1273480453 +a1922 2 + +\change_inserted 0 1273480455 +a1926 2 + +\change_inserted 0 1273480450 +a1930 2 + +\change_inserted 0 1273480452 +a1935 2 +\change_inserted 0 1273478830 + +a1942 5 + +\change_deleted 0 1273481604 +In theory, we could get away with 2: one after we write the new data, and + one to somehow atomically change over to it. +\change_inserted 0 1273481632 +a1946 2 + +\change_inserted 0 1273481724 +a1950 2 + +\change_inserted 0 1273481713 +a1954 2 + +\change_inserted 0 1273481717 +a1958 2 + +\change_inserted 0 1273481730 +a1962 2 + +\change_inserted 0 1273481736 +a1966 2 + +\change_inserted 0 1273481744 +a1970 2 + +\change_inserted 0 1273481748 +a1974 2 + +\change_inserted 0 1273482185 +a1978 2 + +\change_inserted 0 1273482259 +a1989 50 + +\change_deleted 0 1273481848 +None. + Trying to rewrite the transaction code is a separate experiment, which + I encourage someone else to do. + At some point you say +\begin_inset Quotes eld +\end_inset + +use a real database +\begin_inset Quotes erd +\end_inset + +. +\end_layout + +\begin_layout Standard + +\change_deleted 0 1273481848 +But as a thought experiment: +\change_unchanged + +\end_layout + +\begin_layout Standard + +\change_deleted 0 1273481788 +Say there was a pointer in the header which said where the hash table and + free list tables were, and that no blocks were labeled with whether they + were free or not (it had to be derived from what list they were in). + We could create new hash table and free list in some free space, and populate + it as we want the post-committed state to look. + Then we sync, then we switch the offset in the header, then we sync again. +\end_layout + +\begin_layout Standard + +\change_deleted 0 1273481788 +This would not allow arbitrary changes to the database, such as tdb_repack + does, and would require more space (since we have to preserve the current + and future entries at once). + If we used hash trees rather than one big hash table, we might only have + to rewrite some sections of the hash, too. +\change_inserted 0 1273481854 + +\end_layout + +\begin_layout Standard + +\change_inserted 0 1273482102 +a1993 2 + +\change_inserted 0 1273482061 +a1998 2 + +\change_inserted 0 1273482063 +a2002 2 + +\change_inserted 0 1273482072 +a2006 2 + +\change_inserted 0 1273482139 +a2011 2 + +\change_inserted 0 1273482364 +a2015 2 + +\change_inserted 0 1273482163 +a2019 2 + +\change_inserted 0 1273482493 +a2037 2 + +\change_inserted 0 1273482536 +a2046 2 +\change_unchanged + +a2049 2 + +\change_inserted 0 1273482641 +a2058 2 + +\change_inserted 0 1273481827 +d2067 2 +a2068 11 +We could +\change_inserted 0 1273481829 +then +\change_unchanged +implement snapshots using a similar method +\change_deleted 0 1273481838 + to the above, only +\change_inserted 0 1273481840 +, +\change_unchanged + using multiple different hash tables/free tables. +@ + + +1.2 +log +@After first feedback (Ronnie & Volker) +@ +text +@d1314 13 +d1531 11 +a1541 1 +The free list should be split into multiple lists to reduce contention. +d1547 39 +d1596 7 +d1604 1 +a1604 1 +The algorithm for freeing is simple: +d1608 7 +a1614 1 +Identify the correct free list. +d1618 30 +a1647 1 +Lock the list, and place the freed entry at the head. +d1651 7 +a1657 2 +Allocation is a little more complicated, as we merge entries as we walk + the list: +d1661 19 +a1679 1 +Pick a free list; either the list we last freed onto, or based on a +d1691 17 +a1707 1 +Lock that list. +d1711 7 +a1717 1 +If the top entry is well-sized, remove it from the list and return it. +d1721 5 +a1725 1 +Otherwise, examine the entry to the right of it in the file. +d1731 2 +d1737 2 +d1743 2 +d1749 2 +d1756 8 +d1765 2 +d1770 2 +d1773 2 +d1778 7 +a1784 1 +If no list satisfies, expand the file. +d1788 28 +a1815 2 +This optimizes rapid insert/delete of free list entries, and allows us to + get rid of the tailer altogether. +d1819 2 +d1851 1 +a1851 1 +\change_inserted 0 1272941474 +d1857 303 +a2159 18 +\change_inserted 0 1272942759 +There are various ways to organize these lists, but because we want to be + able to quickly identify which free list an entry is in, and reduce the + number of locks required for merging, we will use zoning (eg. + each of the N free lists in a tdb file of size M covers a fixed fraction + M/N). + Note that this means we need to reshuffle the free lists when we expand + the file; this is probably acceptable when we double the hash table size, + since that is such an expensive operation already. + In the case of increasing the file size, there is an optimization we can + use: if we use M in the formula above as the file size rounded up to the + next power of 2, we only need reshuffle free lists when the file size crosses + a power of 2 boundary, +\emph on +and +\emph default +reshuffling the free lists is trivial: we simply merge every consecutive + pair of free lists. +d2164 107 +d2276 2 +d2280 59 +d2346 2 +d2363 2 +d2366 2 +d2371 2 +d2382 2 +d2389 57 +d2458 13 +d2474 32 +a2505 2 +We could implement snapshots using a similar method to the above, only using + multiple different hash tables/free tables. +@ + + +1.1 +log +@Initial revision +@ +text +@d1 1 +a1 1 +#LyX 1.6.4 created this file. For more info see http://www.lyx.org/ +d36 3 +a38 3 +\tracking_changes false +\output_changes false +\author "" +d662 5 +a666 1 + behavior of disallowing transactions should become the default. +d1215 21 +d1527 2 +d1533 3 +a1535 1 + The algorithm for freeing is simple: +d1642 26 +@ diff --git a/lib/tdb2/doc/design.pdf b/lib/tdb2/doc/design.pdf Binary files differnew file mode 100644 index 0000000000..558dc1f8c2 --- /dev/null +++ b/lib/tdb2/doc/design.pdf diff --git a/lib/tdb2/doc/design.txt b/lib/tdb2/doc/design.txt new file mode 100644 index 0000000000..bd2ffde4db --- /dev/null +++ b/lib/tdb2/doc/design.txt @@ -0,0 +1,1258 @@ +TDB2: A Redesigning The Trivial DataBase + +Rusty Russell, IBM Corporation + +1-December-2010 + +Abstract + +The Trivial DataBase on-disk format is 32 bits; with usage cases +heading towards the 4G limit, that must change. This required +breakage provides an opportunity to revisit TDB's other design +decisions and reassess them. + +1 Introduction + +The Trivial DataBase was originally written by Andrew Tridgell as +a simple key/data pair storage system with the same API as dbm, +but allowing multiple readers and writers while being small +enough (< 1000 lines of C) to include in SAMBA. The simple design +created in 1999 has proven surprisingly robust and performant, +used in Samba versions 3 and 4 as well as numerous other +projects. Its useful life was greatly increased by the +(backwards-compatible!) addition of transaction support in 2005. + +The wider variety and greater demands of TDB-using code has lead +to some organic growth of the API, as well as some compromises on +the implementation. None of these, by themselves, are seen as +show-stoppers, but the cumulative effect is to a loss of elegance +over the initial, simple TDB implementation. Here is a table of +the approximate number of lines of implementation code and number +of API functions at the end of each year: + + ++-----------+----------------+--------------------------------+ +| Year End | API Functions | Lines of C Code Implementation | ++-----------+----------------+--------------------------------+ ++-----------+----------------+--------------------------------+ +| 1999 | 13 | 1195 | ++-----------+----------------+--------------------------------+ +| 2000 | 24 | 1725 | ++-----------+----------------+--------------------------------+ +| 2001 | 32 | 2228 | ++-----------+----------------+--------------------------------+ +| 2002 | 35 | 2481 | ++-----------+----------------+--------------------------------+ +| 2003 | 35 | 2552 | ++-----------+----------------+--------------------------------+ +| 2004 | 40 | 2584 | ++-----------+----------------+--------------------------------+ +| 2005 | 38 | 2647 | ++-----------+----------------+--------------------------------+ +| 2006 | 52 | 3754 | ++-----------+----------------+--------------------------------+ +| 2007 | 66 | 4398 | ++-----------+----------------+--------------------------------+ +| 2008 | 71 | 4768 | ++-----------+----------------+--------------------------------+ +| 2009 | 73 | 5715 | ++-----------+----------------+--------------------------------+ + + +This review is an attempt to catalog and address all the known +issues with TDB and create solutions which address the problems +without significantly increasing complexity; all involved are far +too aware of the dangers of second system syndrome in rewriting a +successful project like this. + +2 API Issues + +2.1 tdb_open_ex Is Not Expandable + +The tdb_open() call was expanded to tdb_open_ex(), which added an +optional hashing function and an optional logging function +argument. Additional arguments to open would require the +introduction of a tdb_open_ex2 call etc. + +2.1.1 Proposed Solution<attributes> + +tdb_open() will take a linked-list of attributes: + +enum tdb_attribute { + + TDB_ATTRIBUTE_LOG = 0, + + TDB_ATTRIBUTE_HASH = 1 + +}; + +struct tdb_attribute_base { + + enum tdb_attribute attr; + + union tdb_attribute *next; + +}; + +struct tdb_attribute_log { + + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_LOG +*/ + + tdb_log_func log_fn; + + void *log_private; + +}; + +struct tdb_attribute_hash { + + struct tdb_attribute_base base; /* .attr = TDB_ATTRIBUTE_HASH +*/ + + tdb_hash_func hash_fn; + + void *hash_private; + +}; + +union tdb_attribute { + + struct tdb_attribute_base base; + + struct tdb_attribute_log log; + + struct tdb_attribute_hash hash; + +}; + +This allows future attributes to be added, even if this expands +the size of the union. + +2.1.2 Status + +Complete. + +2.2 tdb_traverse Makes Impossible Guarantees + +tdb_traverse (and tdb_firstkey/tdb_nextkey) predate transactions, +and it was thought that it was important to guarantee that all +records which exist at the start and end of the traversal would +be included, and no record would be included twice. + +This adds complexity (see[Reliable-Traversal-Adds]) and does not +work anyway for records which are altered (in particular, those +which are expanded may be effectively deleted and re-added behind +the traversal). + +2.2.1 <traverse-Proposed-Solution>Proposed Solution + +Abandon the guarantee. You will see every record if no changes +occur during your traversal, otherwise you will see some subset. +You can prevent changes by using a transaction or the locking +API. + +2.2.2 Status + +Complete. Delete-during-traverse will still delete every record, +too (assuming no other changes). + +2.3 Nesting of Transactions Is Fraught + +TDB has alternated between allowing nested transactions and not +allowing them. Various paths in the Samba codebase assume that +transactions will nest, and in a sense they can: the operation is +only committed to disk when the outer transaction is committed. +There are two problems, however: + +1. Canceling the inner transaction will cause the outer + transaction commit to fail, and will not undo any operations + since the inner transaction began. This problem is soluble with + some additional internal code. + +2. An inner transaction commit can be cancelled by the outer + transaction. This is desirable in the way which Samba's + database initialization code uses transactions, but could be a + surprise to any users expecting a successful transaction commit + to expose changes to others. + +The current solution is to specify the behavior at tdb_open(), +with the default currently that nested transactions are allowed. +This flag can also be changed at runtime. + +2.3.1 Proposed Solution + +Given the usage patterns, it seems that the “least-surprise” +behavior of disallowing nested transactions should become the +default. Additionally, it seems the outer transaction is the only +code which knows whether inner transactions should be allowed, so +a flag to indicate this could be added to tdb_transaction_start. +However, this behavior can be simulated with a wrapper which uses +tdb_add_flags() and tdb_remove_flags(), so the API should not be +expanded for this relatively-obscure case. + +2.3.2 Status + +Incomplete; nesting flag is still defined as per tdb1. + +2.4 Incorrect Hash Function is Not Detected + +tdb_open_ex() allows the calling code to specify a different hash +function to use, but does not check that all other processes +accessing this tdb are using the same hash function. The result +is that records are missing from tdb_fetch(). + +2.4.1 Proposed Solution + +The header should contain an example hash result (eg. the hash of +0xdeadbeef), and tdb_open_ex() should check that the given hash +function produces the same answer, or fail the tdb_open call. + +2.4.2 Status + +Complete. + +2.5 tdb_set_max_dead/TDB_VOLATILE Expose Implementation + +In response to scalability issues with the free list ([TDB-Freelist-Is] +) two API workarounds have been incorporated in TDB: +tdb_set_max_dead() and the TDB_VOLATILE flag to tdb_open. The +latter actually calls the former with an argument of “5”. + +This code allows deleted records to accumulate without putting +them in the free list. On delete we iterate through each chain +and free them in a batch if there are more than max_dead entries. +These are never otherwise recycled except as a side-effect of a +tdb_repack. + +2.5.1 Proposed Solution + +With the scalability problems of the freelist solved, this API +can be removed. The TDB_VOLATILE flag may still be useful as a +hint that store and delete of records will be at least as common +as fetch in order to allow some internal tuning, but initially +will become a no-op. + +2.5.2 Status + +Incomplete. TDB_VOLATILE still defined, but implementation should +fail on unknown flags to be future-proof. + +2.6 <TDB-Files-Cannot>TDB Files Cannot Be Opened Multiple Times + In The Same Process + +No process can open the same TDB twice; we check and disallow it. +This is an unfortunate side-effect of fcntl locks, which operate +on a per-file rather than per-file-descriptor basis, and do not +nest. Thus, closing any file descriptor on a file clears all the +locks obtained by this process, even if they were placed using a +different file descriptor! + +Note that even if this were solved, deadlock could occur if +operations were nested: this is a more manageable programming +error in most cases. + +2.6.1 Proposed Solution + +We could lobby POSIX to fix the perverse rules, or at least lobby +Linux to violate them so that the most common implementation does +not have this restriction. This would be a generally good idea +for other fcntl lock users. + +Samba uses a wrapper which hands out the same tdb_context to +multiple callers if this happens, and does simple reference +counting. We should do this inside the tdb library, which already +emulates lock nesting internally; it would need to recognize when +deadlock occurs within a single process. This would create a new +failure mode for tdb operations (while we currently handle +locking failures, they are impossible in normal use and a process +encountering them can do little but give up). + +I do not see benefit in an additional tdb_open flag to indicate +whether re-opening is allowed, as though there may be some +benefit to adding a call to detect when a tdb_context is shared, +to allow other to create such an API. + +2.6.2 Status + +Incomplete. + +2.7 TDB API Is Not POSIX Thread-safe + +The TDB API uses an error code which can be queried after an +operation to determine what went wrong. This programming model +does not work with threads, unless specific additional guarantees +are given by the implementation. In addition, even +otherwise-independent threads cannot open the same TDB (as in [TDB-Files-Cannot] +). + +2.7.1 Proposed Solution + +Reachitecting the API to include a tdb_errcode pointer would be a +great deal of churn; we are better to guarantee that the +tdb_errcode is per-thread so the current programming model can be +maintained. + +This requires dynamic per-thread allocations, which is awkward +with POSIX threads (pthread_key_create space is limited and we +cannot simply allocate a key for every TDB). + +Internal locking is required to make sure that fcntl locks do not +overlap between threads, and also that the global list of tdbs is +maintained. + +The aim is that building tdb with -DTDB_PTHREAD will result in a +pthread-safe version of the library, and otherwise no overhead +will exist. Alternatively, a hooking mechanism similar to that +proposed for [Proposed-Solution-locking-hook] could be used to +enable pthread locking at runtime. + +2.7.2 Status + +Incomplete. + +2.8 *_nonblock Functions And *_mark Functions Expose + Implementation + +CTDB[footnote: +Clustered TDB, see http://ctdb.samba.org +] wishes to operate on TDB in a non-blocking manner. This is +currently done as follows: + +1. Call the _nonblock variant of an API function (eg. + tdb_lockall_nonblock). If this fails: + +2. Fork a child process, and wait for it to call the normal + variant (eg. tdb_lockall). + +3. If the child succeeds, call the _mark variant to indicate we + already have the locks (eg. tdb_lockall_mark). + +4. Upon completion, tell the child to release the locks (eg. + tdb_unlockall). + +5. Indicate to tdb that it should consider the locks removed (eg. + tdb_unlockall_mark). + +There are several issues with this approach. Firstly, adding two +new variants of each function clutters the API for an obscure +use, and so not all functions have three variants. Secondly, it +assumes that all paths of the functions ask for the same locks, +otherwise the parent process will have to get a lock which the +child doesn't have under some circumstances. I don't believe this +is currently the case, but it constrains the implementation. + +2.8.1 <Proposed-Solution-locking-hook>Proposed Solution + +Implement a hook for locking methods, so that the caller can +control the calls to create and remove fcntl locks. In this +scenario, ctdbd would operate as follows: + +1. Call the normal API function, eg tdb_lockall(). + +2. When the lock callback comes in, check if the child has the + lock. Initially, this is always false. If so, return 0. + Otherwise, try to obtain it in non-blocking mode. If that + fails, return EWOULDBLOCK. + +3. Release locks in the unlock callback as normal. + +4. If tdb_lockall() fails, see if we recorded a lock failure; if + so, call the child to repeat the operation. + +5. The child records what locks it obtains, and returns that + information to the parent. + +6. When the child has succeeded, goto 1. + +This is flexible enough to handle any potential locking scenario, +even when lock requirements change. It can be optimized so that +the parent does not release locks, just tells the child which +locks it doesn't need to obtain. + +It also keeps the complexity out of the API, and in ctdbd where +it is needed. + +2.8.2 Status + +Incomplete. + +2.9 tdb_chainlock Functions Expose Implementation + +tdb_chainlock locks some number of records, including the record +indicated by the given key. This gave atomicity guarantees; +no-one can start a transaction, alter, read or delete that key +while the lock is held. + +It also makes the same guarantee for any other key in the chain, +which is an internal implementation detail and potentially a +cause for deadlock. + +2.9.1 Proposed Solution + +None. It would be nice to have an explicit single entry lock +which effected no other keys. Unfortunately, this won't work for +an entry which doesn't exist. Thus while chainlock may be +implemented more efficiently for the existing case, it will still +have overlap issues with the non-existing case. So it is best to +keep the current (lack of) guarantee about which records will be +effected to avoid constraining our implementation. + +2.10 Signal Handling is Not Race-Free + +The tdb_setalarm_sigptr() call allows the caller's signal handler +to indicate that the tdb locking code should return with a +failure, rather than trying again when a signal is received (and +errno == EAGAIN). This is usually used to implement timeouts. + +Unfortunately, this does not work in the case where the signal is +received before the tdb code enters the fcntl() call to place the +lock: the code will sleep within the fcntl() code, unaware that +the signal wants it to exit. In the case of long timeouts, this +does not happen in practice. + +2.10.1 Proposed Solution + +The locking hooks proposed in[Proposed-Solution-locking-hook] +would allow the user to decide on whether to fail the lock +acquisition on a signal. This allows the caller to choose their +own compromise: they could narrow the race by checking +immediately before the fcntl call.[footnote: +It may be possible to make this race-free in some implementations +by having the signal handler alter the struct flock to make it +invalid. This will cause the fcntl() lock call to fail with +EINVAL if the signal occurs before the kernel is entered, +otherwise EAGAIN. +] + +2.10.2 Status + +Incomplete. + +2.11 The API Uses Gratuitous Typedefs, Capitals + +typedefs are useful for providing source compatibility when types +can differ across implementations, or arguably in the case of +function pointer definitions which are hard for humans to parse. +Otherwise it is simply obfuscation and pollutes the namespace. + +Capitalization is usually reserved for compile-time constants and +macros. + + TDB_CONTEXT There is no reason to use this over 'struct + tdb_context'; the definition isn't visible to the API user + anyway. + + TDB_DATA There is no reason to use this over struct TDB_DATA; + the struct needs to be understood by the API user. + + struct TDB_DATA This would normally be called 'struct + tdb_data'. + + enum TDB_ERROR Similarly, this would normally be enum + tdb_error. + +2.11.1 Proposed Solution + +None. Introducing lower case variants would please pedants like +myself, but if it were done the existing ones should be kept. +There is little point forcing a purely cosmetic change upon tdb +users. + +2.12 <tdb_log_func-Doesnt-Take>tdb_log_func Doesn't Take The + Private Pointer + +For API compatibility reasons, the logging function needs to call +tdb_get_logging_private() to retrieve the pointer registered by +the tdb_open_ex for logging. + +2.12.1 Proposed Solution + +It should simply take an extra argument, since we are prepared to +break the API/ABI. + +2.12.2 Status + +Complete. + +2.13 Various Callback Functions Are Not Typesafe + +The callback functions in tdb_set_logging_function (after [tdb_log_func-Doesnt-Take] + is resolved), tdb_parse_record, tdb_traverse, tdb_traverse_read +and tdb_check all take void * and must internally convert it to +the argument type they were expecting. + +If this type changes, the compiler will not produce warnings on +the callers, since it only sees void *. + +2.13.1 Proposed Solution + +With careful use of macros, we can create callback functions +which give a warning when used on gcc and the types of the +callback and its private argument differ. Unsupported compilers +will not give a warning, which is no worse than now. In addition, +the callbacks become clearer, as they need not use void * for +their parameter. + +See CCAN's typesafe_cb module at +http://ccan.ozlabs.org/info/typesafe_cb.html + +2.13.2 Status + +Incomplete. + +2.14 TDB_CLEAR_IF_FIRST Must Be Specified On All Opens, + tdb_reopen_all Problematic + +The TDB_CLEAR_IF_FIRST flag to tdb_open indicates that the TDB +file should be cleared if the caller discovers it is the only +process with the TDB open. However, if any caller does not +specify TDB_CLEAR_IF_FIRST it will not be detected, so will have +the TDB erased underneath them (usually resulting in a crash). + +There is a similar issue on fork(); if the parent exits (or +otherwise closes the tdb) before the child calls tdb_reopen_all() +to establish the lock used to indicate the TDB is opened by +someone, a TDB_CLEAR_IF_FIRST opener at that moment will believe +it alone has opened the TDB and will erase it. + +2.14.1 Proposed Solution + +Remove TDB_CLEAR_IF_FIRST. Other workarounds are possible, but +see [TDB_CLEAR_IF_FIRST-Imposes-Performance]. + +2.14.2 Status + +Incomplete, TDB_CLEAR_IF_FIRST still defined, but not +implemented. + +2.15 Extending The Header Is Difficult + +We have reserved (zeroed) words in the TDB header, which can be +used for future features. If the future features are compulsory, +the version number must be updated to prevent old code from +accessing the database. But if the future feature is optional, we +have no way of telling if older code is accessing the database or +not. + +2.15.1 Proposed Solution + +The header should contain a “format variant” value (64-bit). This +is divided into two 32-bit parts: + +1. The lower part reflects the format variant understood by code + accessing the database. + +2. The upper part reflects the format variant you must understand + to write to the database (otherwise you can only open for + reading). + +The latter field can only be written at creation time, the former +should be written under the OPEN_LOCK when opening the database +for writing, if the variant of the code is lower than the current +lowest variant. + +This should allow backwards-compatible features to be added, and +detection if older code (which doesn't understand the feature) +writes to the database. + +2.15.2 Status + +Incomplete. + +2.16 Record Headers Are Not Expandible + +If we later want to add (say) checksums on keys and data, it +would require another format change, which we'd like to avoid. + +2.16.1 Proposed Solution + +We often have extra padding at the tail of a record. If we ensure +that the first byte (if any) of this padding is zero, we will +have a way for future changes to detect code which doesn't +understand a new format: the new code would write (say) a 1 at +the tail, and thus if there is no tail or the first byte is 0, we +would know the extension is not present on that record. + +2.16.2 Status + +Incomplete. + +2.17 TDB Does Not Use Talloc + +Many users of TDB (particularly Samba) use the talloc allocator, +and thus have to wrap TDB in a talloc context to use it +conveniently. + +2.17.1 Proposed Solution + +The allocation within TDB is not complicated enough to justify +the use of talloc, and I am reluctant to force another +(excellent) library on TDB users. Nonetheless a compromise is +possible. An attribute (see [attributes]) can be added later to +tdb_open() to provide an alternate allocation mechanism, +specifically for talloc but usable by any other allocator (which +would ignore the “context” argument). + +This would form a talloc heirarchy as expected, but the caller +would still have to attach a destructor to the tdb context +returned from tdb_open to close it. All TDB_DATA fields would be +children of the tdb_context, and the caller would still have to +manage them (using talloc_free() or talloc_steal()). + +2.17.2 Status + +Deferred. + +3 Performance And Scalability Issues + +3.1 <TDB_CLEAR_IF_FIRST-Imposes-Performance>TDB_CLEAR_IF_FIRST + Imposes Performance Penalty + +When TDB_CLEAR_IF_FIRST is specified, a 1-byte read lock is +placed at offset 4 (aka. the ACTIVE_LOCK). While these locks +never conflict in normal tdb usage, they do add substantial +overhead for most fcntl lock implementations when the kernel +scans to detect if a lock conflict exists. This is often a single +linked list, making the time to acquire and release a fcntl lock +O(N) where N is the number of processes with the TDB open, not +the number actually doing work. + +In a Samba server it is common to have huge numbers of clients +sitting idle, and thus they have weaned themselves off the +TDB_CLEAR_IF_FIRST flag.[footnote: +There is a flag to tdb_reopen_all() which is used for this +optimization: if the parent process will outlive the child, the +child does not need the ACTIVE_LOCK. This is a workaround for +this very performance issue. +] + +3.1.1 Proposed Solution + +Remove the flag. It was a neat idea, but even trivial servers +tend to know when they are initializing for the first time and +can simply unlink the old tdb at that point. + +3.1.2 Status + +Incomplete; TDB_CLEAR_IF_FIRST still defined, but does nothing. + +3.2 TDB Files Have a 4G Limit + +This seems to be becoming an issue (so much for “trivial”!), +particularly for ldb. + +3.2.1 Proposed Solution + +A new, incompatible TDB format which uses 64 bit offsets +internally rather than 32 bit as now. For simplicity of endian +conversion (which TDB does on the fly if required), all values +will be 64 bit on disk. In practice, some upper bits may be used +for other purposes, but at least 56 bits will be available for +file offsets. + +tdb_open() will automatically detect the old version, and even +create them if TDB_VERSION6 is specified to tdb_open. + +32 bit processes will still be able to access TDBs larger than 4G +(assuming that their off_t allows them to seek to 64 bits), they +will gracefully fall back as they fail to mmap. This can happen +already with large TDBs. + +Old versions of tdb will fail to open the new TDB files (since 28 +August 2009, commit 398d0c29290: prior to that any unrecognized +file format would be erased and initialized as a fresh tdb!) + +3.2.2 Status + +Complete. + +3.3 TDB Records Have a 4G Limit + +This has not been a reported problem, and the API uses size_t +which can be 64 bit on 64 bit platforms. However, other limits +may have made such an issue moot. + +3.3.1 Proposed Solution + +Record sizes will be 64 bit, with an error returned on 32 bit +platforms which try to access such records (the current +implementation would return TDB_ERR_OOM in a similar case). It +seems unlikely that 32 bit keys will be a limitation, so the +implementation may not support this (see [sub:Records-Incur-A]). + +3.3.2 Status + +Complete. + +3.4 Hash Size Is Determined At TDB Creation Time + +TDB contains a number of hash chains in the header; the number is +specified at creation time, and defaults to 131. This is such a +bottleneck on large databases (as each hash chain gets quite +long), that LDB uses 10,000 for this hash. In general it is +impossible to know what the 'right' answer is at database +creation time. + +3.4.1 <sub:Hash-Size-Solution>Proposed Solution + +After comprehensive performance testing on various scalable hash +variants[footnote: +http://rusty.ozlabs.org/?p=89 and http://rusty.ozlabs.org/?p=94 +This was annoying because I was previously convinced that an +expanding tree of hashes would be very close to optimal. +], it became clear that it is hard to beat a straight linear hash +table which doubles in size when it reaches saturation. +Unfortunately, altering the hash table introduces serious locking +complications: the entire hash table needs to be locked to +enlarge the hash table, and others might be holding locks. +Particularly insidious are insertions done under tdb_chainlock. + +Thus an expanding layered hash will be used: an array of hash +groups, with each hash group exploding into pointers to lower +hash groups once it fills, turning into a hash tree. This has +implications for locking: we must lock the entire group in case +we need to expand it, yet we don't know how deep the tree is at +that point. + +Note that bits from the hash table entries should be stolen to +hold more hash bits to reduce the penalty of collisions. We can +use the otherwise-unused lower 3 bits. If we limit the size of +the database to 64 exabytes, we can use the top 8 bits of the +hash entry as well. These 11 bits would reduce false positives +down to 1 in 2000 which is more than we need: we can use one of +the bits to indicate that the extra hash bits are valid. This +means we can choose not to re-hash all entries when we expand a +hash group; simply use the next bits we need and mark them +invalid. + +3.4.2 Status + +Complete. + +3.5 <TDB-Freelist-Is>TDB Freelist Is Highly Contended + +TDB uses a single linked list for the free list. Allocation +occurs as follows, using heuristics which have evolved over time: + +1. Get the free list lock for this whole operation. + +2. Multiply length by 1.25, so we always over-allocate by 25%. + +3. Set the slack multiplier to 1. + +4. Examine the current freelist entry: if it is > length but < + the current best case, remember it as the best case. + +5. Multiply the slack multiplier by 1.05. + +6. If our best fit so far is less than length * slack multiplier, + return it. The slack will be turned into a new free record if + it's large enough. + +7. Otherwise, go onto the next freelist entry. + +Deleting a record occurs as follows: + +1. Lock the hash chain for this whole operation. + +2. Walk the chain to find the record, keeping the prev pointer + offset. + +3. If max_dead is non-zero: + + (a) Walk the hash chain again and count the dead records. + + (b) If it's more than max_dead, bulk free all the dead ones + (similar to steps 4 and below, but the lock is only obtained + once). + + (c) Simply mark this record as dead and return. + +4. Get the free list lock for the remainder of this operation. + +5. <right-merging>Examine the following block to see if it is + free; if so, enlarge the current block and remove that block + from the free list. This was disabled, as removal from the free + list was O(entries-in-free-list). + +6. Examine the preceeding block to see if it is free: for this + reason, each block has a 32-bit tailer which indicates its + length. If it is free, expand it to cover our new block and + return. + +7. Otherwise, prepend ourselves to the free list. + +Disabling right-merging (step [right-merging]) causes +fragmentation; the other heuristics proved insufficient to +address this, so the final answer to this was that when we expand +the TDB file inside a transaction commit, we repack the entire +tdb. + +The single list lock limits our allocation rate; due to the other +issues this is not currently seen as a bottleneck. + +3.5.1 Proposed Solution + +The first step is to remove all the current heuristics, as they +obviously interact, then examine them once the lock contention is +addressed. + +The free list must be split to reduce contention. Assuming +perfect free merging, we can at most have 1 free list entry for +each entry. This implies that the number of free lists is related +to the size of the hash table, but as it is rare to walk a large +number of free list entries we can use far fewer, say 1/32 of the +number of hash buckets. + +It seems tempting to try to reuse the hash implementation which +we use for records here, but we have two ways of searching for +free entries: for allocation we search by size (and possibly +zone) which produces too many clashes for our hash table to +handle well, and for coalescing we search by address. Thus an +array of doubly-linked free lists seems preferable. + +There are various benefits in using per-size free lists (see [sub:TDB-Becomes-Fragmented] +) but it's not clear this would reduce contention in the common +case where all processes are allocating/freeing the same size. +Thus we almost certainly need to divide in other ways: the most +obvious is to divide the file into zones, and using a free list +(or table of free lists) for each. This approximates address +ordering. + +Unfortunately it is difficult to know what heuristics should be +used to determine zone sizes, and our transaction code relies on +being able to create a “recovery area” by simply appending to the +file (difficult if it would need to create a new zone header). +Thus we use a linked-list of free tables; currently we only ever +create one, but if there is more than one we choose one at random +to use. In future we may use heuristics to add new free tables on +contention. We only expand the file when all free tables are +exhausted. + +The basic algorithm is as follows. Freeing is simple: + +1. Identify the correct free list. + +2. Lock the corresponding list. + +3. Re-check the list (we didn't have a lock, sizes could have + changed): relock if necessary. + +4. Place the freed entry in the list. + +Allocation is a little more complicated, as we perform delayed +coalescing at this point: + +1. Pick a free table; usually the previous one. + +2. Lock the corresponding list. + +3. If the top entry is -large enough, remove it from the list and + return it. + +4. Otherwise, coalesce entries in the list.If there was no entry + large enough, unlock the list and try the next largest list + +5. If no list has an entry which meets our needs, try the next + free table. + +6. If no zone satisfies, expand the file. + +This optimizes rapid insert/delete of free list entries by not +coalescing them all the time.. First-fit address ordering +ordering seems to be fairly good for keeping fragmentation low +(see [sub:TDB-Becomes-Fragmented]). Note that address ordering +does not need a tailer to coalesce, though if we needed one we +could have one cheaply: see [sub:Records-Incur-A]. + +Each free entry has the free table number in the header: less +than 255. It also contains a doubly-linked list for easy +deletion. + +3.6 <sub:TDB-Becomes-Fragmented>TDB Becomes Fragmented + +Much of this is a result of allocation strategy[footnote: +The Memory Fragmentation Problem: Solved? Johnstone & Wilson 1995 +ftp://ftp.cs.utexas.edu/pub/garbage/malloc/ismm98.ps +] and deliberate hobbling of coalescing; internal fragmentation +(aka overallocation) is deliberately set at 25%, and external +fragmentation is only cured by the decision to repack the entire +db when a transaction commit needs to enlarge the file. + +3.6.1 Proposed Solution + +The 25% overhead on allocation works in practice for ldb because +indexes tend to expand by one record at a time. This internal +fragmentation can be resolved by having an “expanded” bit in the +header to note entries that have previously expanded, and +allocating more space for them. + +There are is a spectrum of possible solutions for external +fragmentation: one is to use a fragmentation-avoiding allocation +strategy such as best-fit address-order allocator. The other end +of the spectrum would be to use a bump allocator (very fast and +simple) and simply repack the file when we reach the end. + +There are three problems with efficient fragmentation-avoiding +allocators: they are non-trivial, they tend to use a single free +list for each size, and there's no evidence that tdb allocation +patterns will match those recorded for general allocators (though +it seems likely). + +Thus we don't spend too much effort on external fragmentation; we +will be no worse than the current code if we need to repack on +occasion. More effort is spent on reducing freelist contention, +and reducing overhead. + +3.7 <sub:Records-Incur-A>Records Incur A 28-Byte Overhead + +Each TDB record has a header as follows: + +struct tdb_record { + + tdb_off_t next; /* offset of the next record in the list +*/ + + tdb_len_t rec_len; /* total byte length of record */ + + tdb_len_t key_len; /* byte length of key */ + + tdb_len_t data_len; /* byte length of data */ + + uint32_t full_hash; /* the full 32 bit hash of the key */ + + uint32_t magic; /* try to catch errors */ + + /* the following union is implied: + + union { + + char record[rec_len]; + + struct { + + char key[key_len]; + + char data[data_len]; + + } + + uint32_t totalsize; (tailer) + + } + + */ + +}; + +Naively, this would double to a 56-byte overhead on a 64 bit +implementation. + +3.7.1 Proposed Solution + +We can use various techniques to reduce this for an allocated +block: + +1. The 'next' pointer is not required, as we are using a flat + hash table. + +2. 'rec_len' can instead be expressed as an addition to key_len + and data_len (it accounts for wasted or overallocated length in + the record). Since the record length is always a multiple of 8, + we can conveniently fit it in 32 bits (representing up to 35 + bits). + +3. 'key_len' and 'data_len' can be reduced. I'm unwilling to + restrict 'data_len' to 32 bits, but instead we can combine the + two into one 64-bit field and using a 5 bit value which + indicates at what bit to divide the two. Keys are unlikely to + scale as fast as data, so I'm assuming a maximum key size of 32 + bits. + +4. 'full_hash' is used to avoid a memcmp on the “miss” case, but + this is diminishing returns after a handful of bits (at 10 + bits, it reduces 99.9% of false memcmp). As an aside, as the + lower bits are already incorporated in the hash table + resolution, the upper bits should be used here. Note that it's + not clear that these bits will be a win, given the extra bits + in the hash table itself (see [sub:Hash-Size-Solution]). + +5. 'magic' does not need to be enlarged: it currently reflects + one of 5 values (used, free, dead, recovery, and + unused_recovery). It is useful for quick sanity checking + however, and should not be eliminated. + +6. 'tailer' is only used to coalesce free blocks (so a block to + the right can find the header to check if this block is free). + This can be replaced by a single 'free' bit in the header of + the following block (and the tailer only exists in free + blocks).[footnote: +This technique from Thomas Standish. Data Structure Techniques. +Addison-Wesley, Reading, Massachusetts, 1980. +] The current proposed coalescing algorithm doesn't need this, + however. + +This produces a 16 byte used header like this: + +struct tdb_used_record { + + uint32_t used_magic : 16, + + + + key_data_divide: 5, + + top_hash: 11; + + uint32_t extra_octets; + + uint64_t key_and_data_len; + +}; + +And a free record like this: + +struct tdb_free_record { + + uint64_t free_magic: 8, + + prev : 56; + + + + uint64_t free_table: 8, + + total_length : 56 + + uint64_t next;; + +}; + +Note that by limiting valid offsets to 56 bits, we can pack +everything we need into 3 64-byte words, meaning our minimum +record size is 8 bytes. + +3.7.2 Status + +Complete. + +3.8 Transaction Commit Requires 4 fdatasync + +The current transaction algorithm is: + +1. write_recovery_data(); + +2. sync(); + +3. write_recovery_header(); + +4. sync(); + +5. overwrite_with_new_data(); + +6. sync(); + +7. remove_recovery_header(); + +8. sync(); + +On current ext3, each sync flushes all data to disk, so the next +3 syncs are relatively expensive. But this could become a +performance bottleneck on other filesystems such as ext4. + +3.8.1 Proposed Solution + +Neil Brown points out that this is overzealous, and only one sync +is needed: + +1. Bundle the recovery data, a transaction counter and a strong + checksum of the new data. + +2. Strong checksum that whole bundle. + +3. Store the bundle in the database. + +4. Overwrite the oldest of the two recovery pointers in the + header (identified using the transaction counter) with the + offset of this bundle. + +5. sync. + +6. Write the new data to the file. + +Checking for recovery means identifying the latest bundle with a +valid checksum and using the new data checksum to ensure that it +has been applied. This is more expensive than the current check, +but need only be done at open. For running databases, a separate +header field can be used to indicate a transaction in progress; +we need only check for recovery if this is set. + +3.8.2 Status + +Deferred. + +3.9 <sub:TDB-Does-Not>TDB Does Not Have Snapshot Support + +3.9.1 Proposed SolutionNone. At some point you say “use a real + database” (but see [replay-attribute]). + +But as a thought experiment, if we implemented transactions to +only overwrite free entries (this is tricky: there must not be a +header in each entry which indicates whether it is free, but use +of presence in metadata elsewhere), and a pointer to the hash +table, we could create an entirely new commit without destroying +existing data. Then it would be easy to implement snapshots in a +similar way. + +This would not allow arbitrary changes to the database, such as +tdb_repack does, and would require more space (since we have to +preserve the current and future entries at once). If we used hash +trees rather than one big hash table, we might only have to +rewrite some sections of the hash, too. + +We could then implement snapshots using a similar method, using +multiple different hash tables/free tables. + +3.9.2 Status + +Deferred. + +3.10 Transactions Cannot Operate in Parallel + +This would be useless for ldb, as it hits the index records with +just about every update. It would add significant complexity in +resolving clashes, and cause the all transaction callers to write +their code to loop in the case where the transactions spuriously +failed. + +3.10.1 Proposed Solution + +None (but see [replay-attribute]). We could solve a small part of +the problem by providing read-only transactions. These would +allow one write transaction to begin, but it could not commit +until all r/o transactions are done. This would require a new +RO_TRANSACTION_LOCK, which would be upgraded on commit. + +3.10.2 Status + +Deferred. + +3.11 Default Hash Function Is Suboptimal + +The Knuth-inspired multiplicative hash used by tdb is fairly slow +(especially if we expand it to 64 bits), and works best when the +hash bucket size is a prime number (which also means a slow +modulus). In addition, it is highly predictable which could +potentially lead to a Denial of Service attack in some TDB uses. + +3.11.1 Proposed Solution + +The Jenkins lookup3 hash[footnote: +http://burtleburtle.net/bob/c/lookup3.c +] is a fast and superbly-mixing hash. It's used by the Linux +kernel and almost everything else. This has the particular +properties that it takes an initial seed, and produces two 32 bit +hash numbers, which we can combine into a 64-bit hash. + +The seed should be created at tdb-creation time from some random +source, and placed in the header. This is far from foolproof, but +adds a little bit of protection against hash bombing. + +3.11.2 Status + +Complete. + +3.12 <Reliable-Traversal-Adds>Reliable Traversal Adds Complexity + +We lock a record during traversal iteration, and try to grab that +lock in the delete code. If that grab on delete fails, we simply +mark it deleted and continue onwards; traversal checks for this +condition and does the delete when it moves off the record. + +If traversal terminates, the dead record may be left +indefinitely. + +3.12.1 Proposed Solution + +Remove reliability guarantees; see [traverse-Proposed-Solution]. + +3.12.2 Status + +Complete. + +3.13 Fcntl Locking Adds Overhead + +Placing a fcntl lock means a system call, as does removing one. +This is actually one reason why transactions can be faster +(everything is locked once at transaction start). In the +uncontended case, this overhead can theoretically be eliminated. + +3.13.1 Proposed Solution + +None. + +We tried this before with spinlock support, in the early days of +TDB, and it didn't make much difference except in manufactured +benchmarks. + +We could use spinlocks (with futex kernel support under Linux), +but it means that we lose automatic cleanup when a process dies +with a lock. There is a method of auto-cleanup under Linux, but +it's not supported by other operating systems. We could +reintroduce a clear-if-first-style lock and sweep for dead +futexes on open, but that wouldn't help the normal case of one +concurrent opener dying. Increasingly elaborate repair schemes +could be considered, but they require an ABI change (everyone +must use them) anyway, so there's no need to do this at the same +time as everything else. + +3.14 Some Transactions Don't Require Durability + +Volker points out that gencache uses a CLEAR_IF_FIRST tdb for +normal (fast) usage, and occasionally empties the results into a +transactional TDB. This kind of usage prioritizes performance +over durability: as long as we are consistent, data can be lost. + +This would be more neatly implemented inside tdb: a “soft” +transaction commit (ie. syncless) which meant that data may be +reverted on a crash. + +3.14.1 Proposed Solution + +None. + +Unfortunately any transaction scheme which overwrites old data +requires a sync before that overwrite to avoid the possibility of +corruption. + +It seems possible to use a scheme similar to that described in [sub:TDB-Does-Not] +,where transactions are committed without overwriting existing +data, and an array of top-level pointers were available in the +header. If the transaction is “soft” then we would not need a +sync at all: existing processes would pick up the new hash table +and free list and work with that. + +At some later point, a sync would allow recovery of the old data +into the free lists (perhaps when the array of top-level pointers +filled). On crash, tdb_open() would examine the array of top +levels, and apply the transactions until it encountered an +invalid checksum. + +3.15 Tracing Is Fragile, Replay Is External + +The current TDB has compile-time-enabled tracing code, but it +often breaks as it is not enabled by default. In a similar way, +the ctdb code has an external wrapper which does replay tracing +so it can coordinate cluster-wide transactions. + +3.15.1 Proposed Solution<replay-attribute> + +Tridge points out that an attribute can be later added to +tdb_open (see [attributes]) to provide replay/trace hooks, which +could become the basis for this and future parallel transactions +and snapshot support. + +3.15.2 Status + +Deferred. |