- 29 Apr, 2020 1 commit
-
-
Peter Dillinger authored
Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
-
- 25 Apr, 2020 1 commit
-
-
anand76 authored
Summary: False alarms are caused by codepaths that intentionally swallow IO errors. Tests: make crash_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/6741 Reviewed By: ltamasi Differential Revision: D21181138 Pulled By: anand1976 fbshipit-source-id: 5ccfbc68eb192033488de6269e59c00f2c65ce00
-
- 11 Apr, 2020 1 commit
-
-
anand76 authored
Summary: This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538 Test Plan: crash_test make check Reviewed By: riversand963 Differential Revision: D20714347 Pulled By: anand1976 fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
-
- 26 Feb, 2020 1 commit
-
-
Andrew Kryczka authored
Summary: Original author: jeffrey-xiao If we are writing a global seqno for an ingested file, the range tombstone metablock gets accessed and put into the cache during ingestion preparation. At the time, the global seqno of the ingested file has not yet been determined, so the cached block will not have a global seqno. When the file is ingested and we read its range tombstone metablock, it will be returned from the cache with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all zero, so the tombstones cover nothing. This commit removes global_seqno_ variable from Block. When iterating over a block, the global seqno for the block is determined by the iterator instead of storing this mutable attribute in Block. Additionally, this commit adds a regression test to check that keys are deleted when ingesting a file with a global seqno and range deletion tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429 Differential Revision: D19961563 Pulled By: ajkr fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7
-
- 21 Feb, 2020 1 commit
-
-
sdong authored
Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
-
- 28 Jan, 2020 1 commit
-
-
Peter Dillinger authored
Summary: Remove the redundant PartitionedFilterBlockBuilder::num_added_ and ::NumAdded since the parent class, FullFilterBlockBuilder, already provides them. Also rename filters_in_partition_ and filters_per_partition_ to keys_added_to_partition_ and keys_per_partition_ to improve readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6299 Test Plan: make check Differential Revision: D19413278 Pulled By: pdillinger fbshipit-source-id: 04926ee7874477d659cb2b6ae03f2d995fb747e5
-
- 19 Oct, 2019 1 commit
-
-
Levi Tamasi authored
Summary: Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that only the filter block contents are stored in the block cache (as opposed to the earlier design where the cache stored the filter block reader itself, leading to potentially dangling pointers and concurrency bugs). However, this change introduced a performance hit since with the new code, the metadata fields are re-parsed upon every access. This patch reunites the block contents with the filter bits reader to eliminate this overhead; since this is still a self-contained pure data object, it is safe to store it in the cache. (Note: this is similar to how the zstd digest is handled.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936 Test Plan: make asan_check filter_bench results for the old code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.7153 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.4258 Single filter ns/op: 42.5974 Random filter ns/op: 217.861 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.4217 Single filter ns/op: 50.9855 Random filter ns/op: 219.167 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5172 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 32.3556 Single filter ns/op: 83.2239 Random filter ns/op: 370.676 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.2265 Single filter ns/op: 93.5651 Random filter ns/op: 408.393 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` With the new code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 25.4285 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 31.0594 Single filter ns/op: 43.8974 Random filter ns/op: 226.075 ---------------------------- Outside queries... Dry run (25d) ns/op: 31.0295 Single filter ns/op: 50.3824 Random filter ns/op: 226.805 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5308 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.2968 Single filter ns/op: 58.6163 Random filter ns/op: 291.434 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.1839 Single filter ns/op: 66.9039 Random filter ns/op: 292.828 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` Differential Revision: D17991712 Pulled By: ltamasi fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
-
- 12 Oct, 2019 1 commit
-
-
Maysam Yabandeh authored
Summary: Partition Filters make use of a top-level index to find the partition that might have the bloom hash of the key. The index is with internal key format (before format version 3). Each partition contains the i) blooms of the keys in that range ii) bloom of prefixes of keys in that range, iii) the bloom of the prefix of the last key in the previous partition. When ::SeekForPrev(key), we first perform a prefix bloom test on the SST file. The partition however is identified using the full internal key, rather than the prefix key. The reason is to be compatible with the internal key format of the top-level index. This creates a corner case. Example: - SST k, Partition N: P1K1, P1K2 - SST k, top-level index: P1K2 - SST k+1, Partition 1: P2K1, P3K1 - SST k+1 top-level index: P3K1 When SeekForPrev(P1K3), it should point us to P1K2. However SST k top-level index would reject P1K3 since it is out of range. One possible fix would be to search with the prefix P1 (instead of full internal key P1K3) however the details of properly comparing prefix with full internal key might get complicated. The fix we apply in this PR is to look into the last partition anyway even if the key is out of range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5907 Differential Revision: D17889918 Pulled By: maysamyabandeh fbshipit-source-id: 169fd7b3c71dbc08808eae5a8340611ebe5bdc1e
-
- 02 Oct, 2019 1 commit
-
-
Yanqin Jin authored
Summary: Without this fix, compiler complains. ``` $ROCKSDB_NO_FBCODE=1 USE_CLANG=1 make ldb table/block_based/full_filter_block.cc: In constructor ‘rocksdb::FullFilterBlockBuilder::FullFilterBlockBuilder(const rocksdb::SliceTransform*, bool, rocksdb::FilterBitsBuilder*)’: table/block_based/full_filter_block.cc:20:43: error: declaration of ‘prefix_extractor’ shadows a member of 'this' [-Werror=shadow] FilterBitsBuilder* filter_bits_builder) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5872 Test Plan: ``` $ROCKSDB_NO_FBCODE=1 make all ``` Differential Revision: D17690058 Pulled By: riversand963 fbshipit-source-id: 19e3d9bd86e1123847095240e73d30da5d66240e
-
- 25 Sep, 2019 1 commit
-
-
Maysam Yabandeh authored
Summary: Partitioned filters make use of a top-level index to find the partition in which the filter resides. The top-level index has a key per partition. The key is guaranteed to be larger or equal than any key in that partition. When used with format_version 3, which excludes the sequence number form index keys, the separator key in the index could be equal to the prefix of the keys in the next partition. In this way, when searching for the key, the top-level index will lead us to the previous partition, which has no key with that prefix. The prefix bloom test thus returns false, although the prefix exists in the bloom of the next partition. The patch fixes that by a hack: It always adds the prefix of the first key of the next partition to the bloom of the current partition. In this way, in the corner cases that the index will lead us to the previous partition, we still can find the bloom filter there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5835 Differential Revision: D17513585 Pulled By: maysamyabandeh fbshipit-source-id: e2d1ff26c759e6e03875c4d57f4228316ecf50e9
-
- 17 Sep, 2019 1 commit
-
-
Maysam Yabandeh authored
Summary: For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor. This PR builds up on https://github.com/facebook/rocksdb/issues/4258 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797 Test Plan: - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata. - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata. Differential Revision: D17396833 Pulled By: maysamyabandeh fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
-
- 15 Aug, 2019 1 commit
-
-
Levi Tamasi authored
Fix regression affecting partitioned indexes/filters when cache_index_and_filter_blocks is false (#5705) Summary: PR https://github.com/facebook/rocksdb/issues/5298 (and subsequent related patches) unintentionally changed the semantics of cache_index_and_filter_blocks: historically, this option only affected the main index/filter block; with the changes, it affects index/filter partitions as well. This can cause performance issues when cache_index_and_filter_blocks is false since in this case, partitions are neither cached nor preloaded (i.e. they are loaded on demand upon each access). The patch reverts to the earlier behavior, that is, partitions are cached similarly to data blocks regardless of the value of the above option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5705 Test Plan: make check ./db_bench -benchmarks=fillrandom --statistics --stats_interval_seconds=1 --duration=30 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false ./db_bench -benchmarks=readrandom --use_existing_db --statistics --stats_interval_seconds=1 --duration=10 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false --cache_size=8000000000 Relevant statistics from the readrandom benchmark with the old code: rocksdb.block.cache.index.miss COUNT : 0 rocksdb.block.cache.index.hit COUNT : 0 rocksdb.block.cache.index.add COUNT : 0 rocksdb.block.cache.index.bytes.insert COUNT : 0 rocksdb.block.cache.index.bytes.evict COUNT : 0 rocksdb.block.cache.filter.miss COUNT : 0 rocksdb.block.cache.filter.hit COUNT : 0 rocksdb.block.cache.filter.add COUNT : 0 rocksdb.block.cache.filter.bytes.insert COUNT : 0 rocksdb.block.cache.filter.bytes.evict COUNT : 0 With the new code: rocksdb.block.cache.index.miss COUNT : 2500 rocksdb.block.cache.index.hit COUNT : 42696 rocksdb.block.cache.index.add COUNT : 2500 rocksdb.block.cache.index.bytes.insert COUNT : 4050048 rocksdb.block.cache.index.bytes.evict COUNT : 0 rocksdb.block.cache.filter.miss COUNT : 2500 rocksdb.block.cache.filter.hit COUNT : 4550493 rocksdb.block.cache.filter.add COUNT : 2500 rocksdb.block.cache.filter.bytes.insert COUNT : 10331040 rocksdb.block.cache.filter.bytes.evict COUNT : 0 Differential Revision: D16817382 Pulled By: ltamasi fbshipit-source-id: 28a516b0da1f041a03313e0b70b28cf5cf205d00
-
- 24 Jul, 2019 1 commit
-
-
Levi Tamasi authored
Summary: RocksDB has historically stored uncompression dictionary objects in the block cache as opposed to storing just the block contents. This neccesitated evicting the object upon table close. With the new code, only the raw blocks are stored in the cache, eliminating the need for eviction. In addition, the patch makes the following improvements: 1) Compression dictionary blocks are now prefetched/pinned similarly to index/filter blocks. 2) A copy operation got eliminated when the uncompression dictionary is retrieved. 3) Errors related to retrieving the uncompression dictionary are propagated as opposed to silently ignored. Note: the patch temporarily breaks the compression dictionary evicition stats. They will be fixed in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584 Test Plan: make asan_check Differential Revision: D16344151 Pulled By: ltamasi fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b
-
- 17 Jul, 2019 1 commit
-
-
Levi Tamasi authored
Summary: Currently, when the block cache is used for the filter block, it is not really the block itself that is stored in the cache but a FilterBlockReader object. Since this object is not pure data (it has, for instance, pointers that might dangle, including in one case a back pointer to the TableReader), it's not really sharable. To avoid the issues around this, the current code erases the cache entries when the TableReader is closed (which, BTW, is not sufficient since a concurrent TableReader might have picked up the object in the meantime). Instead of doing this, the patch moves the FilterBlockReader out of the cache altogether, and decouples the filter reader object from the filter block. In particular, instead of the TableReader owning, or caching/pinning the FilterBlockReader (based on the customer's settings), with the change the TableReader unconditionally owns the FilterBlockReader, which in turn owns/caches/pins the filter block. This change also enables us to reuse the code paths historically used for data blocks for filters as well. Note: Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a separate phase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504 Test Plan: make asan_check Differential Revision: D16036974 Pulled By: ltamasi fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
-
- 25 Jun, 2019 1 commit
-
-
Mike Kolupaev authored
Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
-
- 21 Jun, 2019 1 commit
-
-
haoyuhuang authored
Summary: This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block. 1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name. 2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable. This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454 Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32. Differential Revision: D15819451 Pulled By: HaoyuHuang fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
-
- 11 Jun, 2019 1 commit
-
-
haoyuhuang authored
Summary: BlockCacheLookupContext only contains the caller for now. We will trace block accesses at five places: 1. BlockBasedTable::GetFilter. 2. BlockBasedTable::GetUncompressedDict. 3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.) 4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.) 5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.) We create the context at: 1. BlockBasedTable::Get. (kUserGet) 2. BlockBasedTable::MultiGet. (kUserMGet) 3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.) 4. BlockBasedTable::Open. (kPrefetch) 5. Index/Filter::CacheDependencies. (kPrefetch) 6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize). I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable. Throughput of this PR: 231334 ops/s. Throughput of the master branch: 238428 ops/s. Experiment setup: RocksDB: version 6.2 Date: Mon Jun 10 10:42:51 2019 CPU: 24 * Intel Core Processor (Skylake) CPUCache: 16384 KB Keys: 20 bytes each Values: 100 bytes each (100 bytes after compression) Entries: 1000000 Prefix: 20 bytes Keys per prefix: 0 RawSize: 114.4 MB (estimated) FileSize: 114.4 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: NoCompression Compression sampling rate: 0 Memtablerep: skip_list Perf Level: 1 Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120 TODOs: 1. Create a caller for external SST file ingestion and differentiate the callers for iterator. 2. Integrate tracer to trace block cache accesses. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421 Differential Revision: D15704258 Pulled By: HaoyuHuang fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
-
- 07 Jun, 2019 1 commit
-
-
Levi Tamasi authored
Summary: The patch cleans up the handling of cache hit/miss/insertion related performance counters, get context counters, and statistics by eliminating some code duplication and factoring out the affected logic into separate methods. In addition, it makes the semantics of cache hit metrics more consistent by changing the code so that accessing a partition of partitioned indexes/filters through a pinned reference no longer counts as a cache hit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408 Differential Revision: D15610883 Pulled By: ltamasi fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1
-
- 31 May, 2019 1 commit
-
-
Vijay Nadimpalli authored
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
-
- 11 May, 2019 1 commit
-
-
Levi Tamasi authored
Summary: CachableEntry is used in a variety of contexts: it may refer to a cached object (i.e. an object in the block cache), an owned object, or an unowned object; also, in some cases (most notably with iterators), the responsibility of managing the pointed-to object gets handed off to another object. Each of the above scenarios have different implications for the lifecycle of the referenced object. For the most part, the patch does not change the lifecycle of managed objects; however, it makes these relationships explicit, and it also enables us to eliminate some hacks and accident-prone code around releasing cache handles and deleting/cleaning up objects. (The only places where the patch changes how an objects are managed are the partitions of partitioned indexes and filters.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252 Differential Revision: D15101358 Pulled By: ltamasi fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be
-
- 18 Sep, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381 Differential Revision: D9879505 Pulled By: maysamyabandeh fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
-
- 10 Aug, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
-
- 13 Jul, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121 Differential Revision: D8816832 Pulled By: maysamyabandeh fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197
-
- 29 Jun, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it. Closes https://github.com/facebook/rocksdb/pull/4073 Differential Revision: D8686552 Pulled By: maysamyabandeh fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
-
- 28 Jun, 2018 1 commit
-
-
Zhongyi Xie authored
Summary: with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests. Closes https://github.com/facebook/rocksdb/pull/4067 Differential Revision: D8669002 Pulled By: miasantreble fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
-
- 20 Jun, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut. This bug could result into false negatives, which means that DB could skip an existing key. Closes https://github.com/facebook/rocksdb/pull/4024 Differential Revision: D8518866 Pulled By: maysamyabandeh fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf
-
- 07 Jun, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
-
- 26 May, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
-
- 22 May, 2018 1 commit
-
-
Zhongyi Xie authored
Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
-
- 13 Apr, 2018 1 commit
-
-
David Lai authored
Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
-
- 10 Apr, 2018 1 commit
-
-
Maysam Yabandeh authored
Summary: The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug. Closes https://github.com/facebook/rocksdb/pull/3692 Differential Revision: D7559763 Pulled By: maysamyabandeh fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a
-
- 22 Mar, 2018 1 commit
-
-
Huachao Huang authored
Summary: Close https://github.com/facebook/rocksdb/issues/3592 Closes https://github.com/facebook/rocksdb/pull/3614 Differential Revision: D7291706 Pulled By: ajkr fbshipit-source-id: 9dd8f40bd7716588e1e3fd6be0c2bc2766861f8c
-
- 07 Mar, 2018 1 commit
-
-
Fosco Marotto authored
Summary: In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported. This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types. Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience. Closes https://github.com/facebook/rocksdb/pull/3503 Differential Revision: D7106457 Pulled By: gfosco fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
-
- 06 Mar, 2018 1 commit
-
-
Andrew Kryczka authored
Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e
-
- 23 Feb, 2018 2 commits
-
-
Igor Sugak authored
Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37
-
David Lai authored
Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461
-
- 13 Dec, 2017 1 commit
-
-
Zhongyi Xie authored
Summary: This PR addresses the following heavy hitters in `Get` operation by moving calls to `StatisticsImpl::recordTick` from `BlockBasedTable` to `Version::Get` - rocksdb.block.cache.bytes.write - rocksdb.block.cache.add - rocksdb.block.cache.data.miss - rocksdb.block.cache.data.bytes.insert - rocksdb.block.cache.data.add - rocksdb.block.cache.hit - rocksdb.block.cache.data.hit - rocksdb.block.cache.bytes.read The db_bench statistics before and after the change are: |1GB block read|Children |Self |Command |Shared Object |Symbol| |---|---|---|---|---|---| |master: |4.22% |1.31% |db_bench |db_bench |[.] rocksdb::StatisticsImpl::recordTick| |updated: |0.51% |0.21% |db_bench |db_bench |[.] rocksdb::StatisticsImpl::recordTick| | |0.14% |0.14% |db_bench |db_bench |[.] rocksdb::GetContext::record_counters| |1MB block read|Children |Self |Command |Shared Object |Symbol| |---|---|---|---|---|---| |master: |3.48% |1.08% |db_bench |db_bench |[.] rocksdb::StatisticsImpl::recordTick| |updated: |0.80% |0.31% |db_bench |db_bench |[.] rocksdb::StatisticsImpl::recordTick| | |0.35% |0.35% |db_bench |db_bench |[.] rocksdb::GetContext::record_counters| Closes https://github.com/facebook/rocksdb/pull/3172 Differential Revision: D6330532 Pulled By: miasantreble fbshipit-source-id: 2b492959e00a3db29e9437ecdcc5e48ca4ec5741
-
- 23 Aug, 2017 1 commit
-
-
Maysam Yabandeh authored
Summary: This is the continuation of https://github.com/facebook/rocksdb/pull/2661 for filter partitions. When pin_l0 is set (along with cache_xxx), then open table open the filter partitions are loaded into the cache and pinned there. Closes https://github.com/facebook/rocksdb/pull/2766 Differential Revision: D5671098 Pulled By: maysamyabandeh fbshipit-source-id: 174f24018f1d7f1129621e7380287b65b67d2115
-
- 12 Aug, 2017 1 commit
-
-
Siying Dong authored
Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
-
- 22 Jul, 2017 1 commit
-
-
Sagar Vemuri authored
Summary: This reverts the previous commit 1d7048c5, which broke the build. Did a `git revert 1d7048c5`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
-