Age | Commit message (Collapse) | Author | Files | Lines |
|
ntdb's transaction code has an optimization which tdb's doesnt: it
only writes the parts of blocks whose contents have changed. This
means we can actually have a transaction which turns out to need no
recovery region.
This breaks the recovery setup logic, which sets the current recovery
size to 0 if there's no recovery area, and assumes that we'll always
create a new recovery area since the recovery will always need > 0
bytes.
In fact, if we really haven't changed anything, we can skip the
transaction commit altogether: since this happens at least once with
Samba, it's worth doing.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Andrew Bartlett <abartlet@samba.org>
|
|
Reported-by: Matthieu Patou <mat@samba.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Autobuild-User(master): Rusty Russell <rusty@rustcorp.com.au>
Autobuild-Date(master): Mon Oct 8 04:43:37 CEST 2012 on sn-devel-104
|
|
Signed-Off-By: Jelmer Vernooij <jelmer@samba.org>
Autobuild-User(master): Jelmer Vernooij <jelmer@samba.org>
Autobuild-Date(master): Tue Sep 25 22:40:39 CEST 2012 on sn-devel-104
|
|
|
|
|
|
Don't expose a libccan.so; it would produce clashes if someone else
does the same thing. Unfortunately, if we just change it from a
SAMBA_LIBRARY to a SAMBA_SUBSYSTEM, it doesn't create a static library
as we'd like, but links all the object files in. This means we get
many duplicates (eg. everyone gets a copy of tally, even though only
ntdb wants it).
So, the solution is twofold:
1) Make the ccan modules separate.
2) Make the ccan modules SAMBA_SUBSYSTEMs not SAMBA_LIBRARYs so we don't
build shared libraries which we can't share.
3) Make the places which uses ccan explicit.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Autobuild-User(master): Rusty Russell <rusty@rustcorp.com.au>
Autobuild-Date(master): Fri Jun 29 06:22:44 CEST 2012 on sn-devel-104
|
|
This means we no longer have to unmap if we want to compare a record.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
In particular, this tests that we can store enough records to make the
database expand while we map the given record. We use a global lock for
this, but it could happen in theory with another process.
It also tests the that we can recurse inside ntdb_parse_record().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Since we have a readlock, any write will grab a write lock: if it happens
to be on the same bucket, we'll fail.
For that reason, enforce read-only so every write operation fails
(even for NTDB_NOLOCK or NTDB_INTERNAL dbs), and document it!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
NTDB_INTERNAL databases need to malloc and copy to keep old versions
around if we expand, in a similar way to the manner in which keep old
mmaps around.
Of course, it only works for read-only accesses, since the two copies
are not synced.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
This means keeping the old mmap around when we expand the database.
We could revert to read/write, except for platforms with incoherent
mmap (ie. OpenBSD), where we need to use mmap for all accesses.
Thus we keep a linked list of old maps, and unmap them when the last access
finally goes away.
This is required if we want ntdb_parse_record() callbacks to be able
to expand the database.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Since we can have multiple openers, we should leave the mmap in place
for the other openers to use. Enhance the test to check the bug (it
still works, because without mmap we fall back to read/write, but
performance would be terrible!).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
-ECUTNPASTE. This is not a usage error!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
This reduces test time from 31 seconds to 6, on my laptop.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Autobuild-User(master): Jelmer Vernooij <jelmer@samba.org>
Autobuild-Date(master): Thu Jun 21 19:59:57 CEST 2012 on sn-devel-104
|
|
Occasionally, the capability test inserts multiple used records and they
clash, but our primitive test layout engine doesn't handle hash clashes
and aborts.
Force a seed value which we know doesn't clash.
Reported-by: Andrew Bartlett <abartlet@samba.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Autobuild-User(master): Rusty Russell <rusty@rustcorp.com.au>
Autobuild-Date(master): Wed Jun 20 16:50:20 CEST 2012 on sn-devel-104
|
|
This is copied from tdb; we build the utilities, but as nothing else
links against it, we shouldn't be adding anything to the normal samba
binary sizes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Autobuild-User(master): Rusty Russell <rusty@rustcorp.com.au>
Autobuild-Date(master): Tue Jun 19 07:31:06 CEST 2012 on sn-devel-104
|
|
Update the design.lyx file with the latest status and the change in hashing.
Also, refresh and add examples to the TDB_porting.txt file.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
We access the key on lookup, then access the data in the caller. It
makes more sense to access both at once. We also put in a likely()
for the case where the hash is not chained.
Before:
Adding 1000 records: 3644-3724(3675) ns (129656 bytes)
Finding 1000 records: 1596-1696(1622) ns (129656 bytes)
Missing 1000 records: 1409-1525(1452) ns (129656 bytes)
Traversing 1000 records: 1636-1747(1668) ns (129656 bytes)
Deleting 1000 records: 3138-3223(3175) ns (129656 bytes)
Re-adding 1000 records: 3278-3414(3329) ns (129656 bytes)
Appending 1000 records: 5396-5529(5426) ns (253312 bytes)
Churning 1000 records: 9451-10095(9584) ns (253312 bytes)
smbtorture results (--entries=1000)
ntdb speed 183881-191112(188223) ops/sec
After:
Adding 1000 records: 3590-3701(3640) ns (129656 bytes)
Finding 1000 records: 1539-1605(1566) ns (129656 bytes)
Missing 1000 records: 1398-1440(1413) ns (129656 bytes)
Traversing 1000 records: 1629-2015(1710) ns (129656 bytes)
Deleting 1000 records: 3118-3236(3163) ns (129656 bytes)
Re-adding 1000 records: 3235-3355(3275) ns (129656 bytes)
Appending 1000 records: 5335-5444(5385) ns (253312 bytes)
Churning 1000 records: 9350-9955(9494) ns (253312 bytes)
smbtorture results (--entries=1000)
ntdb speed 180559-199981(195106) ops/sec
|
|
Since our default hashsize is 8192 not 131, we look fat when we convert
near-empty TDBs.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Just like tdbtorture, having a hashsize of 2 stresses us much more!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Since we've given up on expansion, let them frob the hashsize again.
We have attributes, so we should use them for optional stuff like
this.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
TDB2 started with a top-level hash of 1024 entries, divided into 128
groups of 8 buckets. When a bucket filled, the 8 bucket group
expanded into pointers into 8 new 64-entry hash tables. When these
filled, they expanded in turn, etc.
It's a nice idea to automatically expand the hash tables, but it
doesn't pay off. Remove it for NTDB.
1) It only beats TDB performance when the database is huge and the
TDB hashsize is small. We are about 20% slower on medium-size
databases (1000 to 10000 records), worse on really small ones.
2) Since we're 64 bits, our hash tables are already twice as expensive
as TDB.
3) Since our hash function is good, it means that all groups tend to
fill at the same time, meaning the hash enlarges by a factor of 128
all at once, leading to a very large database at that point.
4) Our efficiency would improve if we enlarged the top level, but
that makes our minimum db size even worse: it's already over 8k,
and jumps to 1M after about 1000 entries!
5) Making the sub group size larger gives a shallower tree, which
performs better, but makes the "hash explosion" problem worse.
6) The code is complicated, having to handle delete and reshuffling
groups of hash buckets, and expansion of buckets.
7) We have to handle the case where all the records somehow end up with
the same hash value, which requires special code to chain records for
that case.
On the other hand, it would be nice if we didn't degrade as badly as
TDB does when the hash chains get long.
This patch removes the hash-growing code, but instead of chaining like
TDB does when a bucket fills, we point the bucket to an array of
record pointers. Since each on-disk NTDB pointer contains some hash
bits from the record (we steal the upper 8 bits of the offset), 99.5%
of the time we don't need to load the record to determine if it
matches. This makes an array of offsets much more cache-friendly than
a linked list.
Here are the times (in ns) for tdb_store of N records, tdb_store of N
records the second time, and a fetch of all N records. I've also
included the final database size and the smbtorture local.[n]tdb_speed
results.
Benchmark details:
1) Compiled with -O2.
2) assert() was disabled in TDB2 and NTDB.
3) The "optimize fetch" patch was applied to NTDB.
10 runs, using tmpfs (otherwise massive swapping as db hits ~30M,
despite plenty of RAM).
Insert Re-ins Fetch Size dbspeed
(nsec) (nsec) (nsec) (Kb) (ops/sec)
TDB (10000 hashsize):
100 records: 3882 3320 1609 53 203204
1000 records: 3651 3281 1571 115 218021
10000 records: 3404 3326 1595 880 202874
100000 records: 4317 3825 2097 8262 126811
1000000 records: 11568 11578 9320 77005 25046
TDB2 (1024 hashsize, expandable):
100 records: 3867 3329 1699 17 187100
1000 records: 4040 3249 1639 154 186255
10000 records: 4143 3300 1695 1226 185110
100000 records: 4481 3425 1800 17848 163483
1000000 records: 4055 3534 1878 106386 160774
NTDB (8192 hashsize)
100 records: 4259 3376 1692 82 190852
1000 records: 3640 3275 1566 130 195106
10000 records: 4337 3438 1614 773 188362
100000 records: 4750 5165 1746 9001 169197
1000000 records: 4897 5180 2341 83838 121901
Analysis:
1) TDB wins on small databases, beating TDB2 by ~15%, NTDB by ~10%.
2) TDB starts to lose when hash chains get 10 long (fetch 10% slower
than TDB2/NTDB).
3) TDB does horribly when hash chains get 100 long (fetch 4x slower
than NTDB, 5x slower than TDB2, insert about 2-3x slower).
4) TDB2 databases are 40% larger than TDB1. NTDB is about 15% larger
than TDB1
|
|
We also split off the NTDB_CONVERT case (where the ntdb is of a
different endian) into its own io function.
NTDB speed:
Adding 10000 records: 3894-9951(8553) ns (815528 bytes)
Finding 10000 records: 1644-4294(3580) ns (815528 bytes)
Missing 10000 records: 1497-4018(3303) ns (815528 bytes)
Traversing 10000 records: 1585-4225(3505) ns (815528 bytes)
Deleting 10000 records: 3088-8154(6927) ns (815528 bytes)
Re-adding 10000 records: 3192-8308(7089) ns (815528 bytes)
Appending 10000 records: 5187-13307(11365) ns (1274312 bytes)
Churning 10000 records: 6772-17567(15078) ns (1274312 bytes)
NTDB speed in transaction:
Adding 10000 records: 1602-2404(2214) ns (815528 bytes)
Finding 10000 records: 456-871(778) ns (815528 bytes)
Missing 10000 records: 393-522(503) ns (815528 bytes)
Traversing 10000 records: 729-1015(945) ns (815528 bytes)
Deleting 10000 records: 1065-1476(1374) ns (815528 bytes)
Re-adding 10000 records: 1397-1930(1819) ns (815528 bytes)
Appending 10000 records: 2927-3351(3184) ns (1274312 bytes)
Churning 10000 records: 3921-4697(4378) ns (1274312 bytes)
smbtorture results:
ntdb speed 86581-191518(175666) ops/sec
Applying patch..increase-top-level.patch
|
|
The simple "is it in range" check can be inline; complex cases can be
handed through to the normal or transaction handler.
NTDB speed:
Adding 10000 records: 4111-9983(9149) ns (815528 bytes)
Finding 10000 records: 1667-4464(3810) ns (815528 bytes)
Missing 10000 records: 1511-3992(3546) ns (815528 bytes)
Traversing 10000 records: 1698-4254(3724) ns (815528 bytes)
Deleting 10000 records: 3608-7998(7358) ns (815528 bytes)
Re-adding 10000 records: 3259-8504(7805) ns (815528 bytes)
Appending 10000 records: 5393-13579(12356) ns (1274312 bytes)
Churning 10000 records: 6966-17813(16136) ns (1274312 bytes)
NTDB speed in transaction:
Adding 10000 records: 916-2230(2004) ns (815528 bytes)
Finding 10000 records: 330-866(770) ns (815528 bytes)
Missing 10000 records: 196-520(471) ns (815528 bytes)
Traversing 10000 records: 356-879(800) ns (815528 bytes)
Deleting 10000 records: 505-1267(1108) ns (815528 bytes)
Re-adding 10000 records: 658-1681(1477) ns (815528 bytes)
Appending 10000 records: 1088-2827(2498) ns (1274312 bytes)
Churning 10000 records: 1636-4267(3785) ns (1274312 bytes)
smbtorture results:
ntdb speed 85588-189430(157110) ops/sec
|
|
This is designed to allow us to make ntdb_context (and NTDB_DATA returned
from ntdb_fetch) a talloc pointer. But it can also be used for any other
alternate allocator.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
NTDB_NOSYNC now just prevents the fsync/msync calls, which speeds
testing while still providing full coverage. It also provides safety
against processes dying during transaction commit (though obviously,
not against the machine dying).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
TDB allows this for internal databases, but it's a bad idea, since the
name is useful for logging.
They're a hassle to deal with, and we'd just end up putting "unnamed"
in there, so let the user deal with it. If they don't, they get an
informative core dump.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
The performance numbers for transaction pagesize are indeterminate:
larger pagesizes means a smaller transaction array, and a better
chance of having a contiguous record (more efficient for
ntdb_parse_record and some internal operations inside a transaction).
On the other hand, large pagesize means more I/O even if we change a
few bytes.
But it also controls the multiple by which we will enlarge the file,
and hence the minimum db size. It's 4k for tdb1, but 16k seems
reasonable in these modern times.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Now our database is always a multiple of NTDB_PGSIZE, we can remove the
special handling for the last block.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
As copied from tdb1, there is logic in the transaction code to handle
a non-PGSIZE multiple db, but in fact this only happens for a
completely unused database: as soon as we add anything to it, it is
expanded to a NTDB_PGSIZE multiple.
If we create the database with a free record which pads it out to
NTDB_PGSIZE, we can remove this last-page-is-different logic.
Of course, the fake ntdbs we create in our tests now also need to be
multiples of NTDB_PGSIZE, so we change some numbers there too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
ntdb uses tdb's transaction code, and it has an undocumented but implicit
assumption: that the transaction recovery area is always aligned to the
transaction pagesize. This means that no block will overlap the recovery
area.
This is maintained by rounding the size of the database up, so do the same
for ntdb.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
We were missing the last few bytes. Found by 100 runs of ntdbtorture
-t -k.
The transaction test code didn't catch this, because usually those
last few bytes are irrelevant to the actual contents of the database.
We fix the test.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Our external test helper is a bit primitive when it comes to doing STORE or
FETCH commands: let us specify the data we expect, instead of assuming it's
the same as the key.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
This is a fairly common pattern in Samba, and if we log an error on
every open it spams the logs. On the other hand, other errors are
potentially more serious, so we still use NTDB_LOG_ERROR on them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
We document that the child of a fork() can do a brunlock() if the parent
does a brlock: we should not log an error when they do this.
Also, test the case where we fork() and return inside a parse function
(which is allowed).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
In tdb, we grab the open lock immediately after we open the file. In
ntdb, we usually did some work first. tdbtorture managed to get in
before the creator grabbed the lock:
testing with 3 processes, 5000 loops, seed=1338246020
ntdb:torture.ntdb:IO Error:ntdb_open: torture.ntdb is not a ntdb file
29023:torture.ntdb:db open failed
At cost of a little duplicated code, we can reduce the race.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
Make --valgrind and --valgrind-log options work!
Amitay figured this out!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
We need --error-exitcode=, otherwise valgrind errors don't cause the
test to fail.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
It was a hack to make compatibility easier. Since we're not doing that,
it can go away: all callers must use the return value now.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|
|
This renames everything from tdb2 to ntdb: importantly, we no longer
use the tdb_ namespace, so you can link against both ntdb and tdb if
you want to.
This also enables building of standalone ntdb by the autobuild script.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
|