Age | Commit message (Collapse) | Author | Files | Lines |
|
Guenther
|
|
|
|
|
|
tdb transactions were designed to be robust against the machine
powering off, but interestingly were never designed to handle the case
where an administrator kill -9's a process during commit. Because
recovery is only done on tdb_open, processes with the tdb already
mapped will simply use it despite it being corrupt and needing
recovery.
The solution to this is to check for recovery every time we grab a
data lock: we could have gained the lock because a process just died.
This has no measurable cost: here is the time for tdbtorture -s 0 -n 1
-l 10000:
Before:
2.75 2.50 2.81 3.19 2.91 2.53 2.72 2.50 2.78 2.77 = Avg 2.75
After:
2.81 2.57 3.42 2.49 3.02 2.49 2.84 2.48 2.80 2.43 = Avg 2.74
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
The current recovery code truncates the tdb file on recovery. This is
fine if recovery is only done on first open, but is a really bad idea
as we move to allowing recovery on "live" databases.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Now the transaction code uses the standard allrecord lock, that stops
us from trying to grab any per-record locks anyway. We don't need to
have special noop lock ops for transactions.
This is a nice simplification: if you see brlock, you know it's really
going to grab a lock.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
tdb_release_extra_locks() is too general: it carefully skips over the
transaction lock, even though the only caller then drops it. Change
this, and rename it to show it's clearly transaction-specific.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Now the transaction allrecord lock is the standard one, and thus is cleaned
in tdb_release_extra_locks(), _tdb_transaction_cancel() doesn't need to
know what type it is.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Centralize locking of all chains of the tdb; rename _tdb_lockall to
tdb_allrecord_lock and _tdb_unlockall to tdb_allrecord_unlock, and
tdb_brlock_upgrade to tdb_allrecord_upgrade.
Then we use this in the transaction code. Unfortunately, if the transaction
code records that it has grabbed the allrecord lock read-only, write locks
will fail, so we treat this upgradable lock as a write lock, and mark it
as upgradable using the otherwise-unused offset field.
One subtlety: now the transaction code is using the allrecord_lock, the
tdb_release_extra_locks() function drops it for us, so we no longer need
to do it manually in _tdb_transaction_cancel.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
This never nests, so it's overkill, but it centralizes the locking into
lock.c and removes the ugly flag in the transaction code to track whether
we have the lock or not.
Note that we have a temporary hack so this places a real lock, despite
the fact that we are in a transaction.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Move locking intelligence back into lock.c, rather than open-coding the
lock release in transaction.c.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
In many places we check whether locks are held: add a helper to do this.
The _tdb_lockall() case has already checked for the allrecord lock, so
the extra work done by tdb_have_extra_locks() is merely redundant.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Because fcntl locks don't nest, we track them in the tdb->lockrecs array
and only place/release them when the count goes to 1/0. We only do this
for record locks, so we simply place the list number (or -1 for the free
list) in the structure.
To generalize this:
1) Put the offset rather than list number in struct tdb_lock_type.
2) Rename _tdb_lock() to tdb_nest_lock, make it non-static and move the
allrecord check out to the callers (except the mark case which doesn't
care).
3) Rename _tdb_unlock() to tdb_nest_unlock(), make it non-static and
move the allrecord out to the callers (except mark again).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
The word global is overloaded in tdb. The global_lock inside struct
tdb_context is used to indicate we hold a lock across all the chains.
Rename it to allrecord_lock.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
The word global is overloaded in tdb. The GLOBAL_LOCK offset is used at
open time to serialize initialization (and by the transaction code to block
open).
Rename it to OPEN_LOCK.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Now tdb_open() calls tdb_transaction_cancel() instead of
_tdb_transaction_cancel, we can make it static.
Signed-off-by: Rusty Russell<rusty@rustcorp.com.au>
|
|
This is taken from the CCAN code base: rather than using tdb_brlock for
locking and unlocking, we split it into brlock and brunlock functions.
For extra debugging information, brunlock says what kind of lock it is
unlocking (even though fnctl locks don't need this). This requires an
extra argument to tdb_transaction_unlock() so we know whether the
lock was upgraded to a write lock or not.
We also use a "flags" argument tdb_brlock:
1) TDB_LOCK_NOWAIT replaces lck_type = F_SETLK (vs F_SETLKW).
2) TDB_LOCK_MARK_ONLY replaces setting TDB_MARK_LOCK bit in ltype.
3) TDB_LOCK_PROBE replaces the "probe" argument.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
This might help on some filesystems
|
|
|
|
If a process (or the machine) dies after just after writing the
recovery head (pointing at the end of file), the recovery record will filled
with 0x42. This will not invoke a recovery on open, since rec.magic
!= TDB_RECOVERY_MAGIC.
Unfortunately, the first transaction commit will happily reuse that
area: tdb_recovery_allocate() doesn't check the magic. The recovery
record has length 0x42424242, and it writes that back into the
now-valid-looking transaction header) for the next comer (which
happens to be tdb_wipe_all in my tests).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
There was a bug in tdb where the
tdb_brlock(tdb, GLOBAL_LOCK, F_UNLCK, F_SETLKW, 0, 1);
(ending the transaction-"mutex") was done before the
/* remove the recovery marker */
This means that when a transaction is committed there is a window where another
opener of the file sees the transaction marker while the transaction committer
is still fully functional and working on it. This led to transaction being
rolled back by that second opener of the file while transaction_commit() gave
no error to the caller.
This patch moves the F_UNLCK to after the recovery marker was removed, closing
this window.
|
|
We need to keep TDB_ALLOW_NESTING as default behavior,
so that existing code continues to work.
However we may change the default together with a major version
number change in future.
metze
|
|
Make the default be that transaction is not allowed and any attempt to create a nested transaction will fail with TDB_ERR_NESTING.
If an application can cope with transaction nesting and the implicit
semantics of tdb_transaction_commit(), it can enable transaction nesting
by using the TDB_ALLOW_NESTING flag.
(cherry picked from ctdb commit 3e49e41c21eb8c53084aa8cc7fd3557bdd8eb7b6)
Signed-off-by: Stefan Metzmacher <metze@samba.org>
|
|
metze
|
|
It was a regrettable hack which I used to reduce line count in tdb; in fact it caused confusion as can be seen in this patch.
In particular, ecode now needs to be set before TDB_LOG anyway, and having it exposed in
the header is useless (the struct tdb_context isn't defined, so it's doubly useless).
Also, we should never set errno, as io.c was doing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
When TDB_TRACE is defined (in tdb_private.h), verbose tracing of tdb operations is enabled.
This can be replayed using "replay_trace" from http://ccan.ozlabs.org/info/tdb.
The majority of this patch comes from moving internal functions to _<funcname> to
avoid double-tracing. There should be no additional overhead for the normal (!TDB_TRACE)
case.
Note that the verbose traces compress really well with rzip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
We previously only allowed a commit to happen after a prepare
commit. It is in fact safe to allow reads between a prepare and a
commit, and the s4 replication code can make use of that, so allow it.
|
|
The idea behind this is to recover from badly fragmented free
lists. Choosing the point where the file expands is fairly arbitrary,
but seems to work well.
|
|
During a transaction commit tdb normally uses fsync/msync calls to
make it crash safe. This can be disabled using the TDB_NOSYNC flag,
but it wasn't disabling all the code paths that caused a fsync/msync.
|
|
|
|
Using tdb_transaction_prepare_commit() gives us 2-phase commits. This
allows us to safely commit across multiple tdb databases at once, with
reasonable transaction semantics
Signed-off-by: tridge@samba.org
|
|
|