1. 06 6月, 2020 1 次提交
  2. 28 5月, 2020 1 次提交
  3. 12 5月, 2020 1 次提交
    • sdong's avatar
      Improve ldb consistency checks (#6802) · 8f9cc109
      sdong 创作于
      When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
      1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
      2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802
      Test Plan: Add a new unit test. Some manual tests with corrupted DBs.
      Reviewed By: pdillinger
      Differential Revision: D21388051
      fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
  4. 09 5月, 2020 2 次提交
  5. 01 5月, 2020 3 次提交
    • Cheng Chang's avatar
      Make users explicitly be aware of prepare before commit (#6775) · ef0c3eda
      Cheng Chang 创作于
      In current commit protocol of pessimistic transaction, if the transaction is not prepared before commit, the commit protocol implicitly assumes that the user wants to commit without prepare.
      This PR adds TransactionOptions::skip_prepare, the default value is `true` because if set to `false`, all existing users who commit without prepare need to update their code to set skip_prepare to true. Although this does not force the user to explicitly express their intention of skip_prepare, it at least lets the user be aware of the assumption of being able to commit without prepare.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6775
      Test Plan: added a new unit test TransactionTest::CommitWithoutPrepare
      Reviewed By: lth
      Differential Revision: D21313270
      Pulled By: cheng-chang
      fbshipit-source-id: 3d95b7c9b2d6cdddc09bdd66c561bc4fae8c3251
    • sdong's avatar
      Flag CompressionOptions::parallel_threads to be experimental (#6781) · 6277e280
      sdong 创作于
      The feature of CompressionOptions::parallel_threads is still not yet mature. Mention it to be experimental in the comments for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6781
      Reviewed By: pdillinger
      Differential Revision: D21330678
      fbshipit-source-id: d7dd7d099fb002a5c6a5d8da689ce5ee08a9eb13
    • anand76's avatar
      Pass a timeout to FileSystem for random reads (#6751) · ab13d43e
      anand76 创作于
      Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
      For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
      Update existing unit tests to verify the correct timeout value is being passed
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
      Reviewed By: riversand963
      Differential Revision: D21285631
      Pulled By: anand1976
      fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
  6. 29 4月, 2020 2 次提交
    • mrambacher's avatar
      Add Functions to OptionTypeInfo (#6422) · 618bf638
      mrambacher 创作于
      Added functions for parsing, serializing, and comparing elements to OptionTypeInfo.  These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map.  Using these functions, every type can be handled via the map rather than special cased.
      By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422
      Test Plan: pass make check
      Reviewed By: siying
      Differential Revision: D21269005
      Pulled By: zhichao-cao
      fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
    • Peter Dillinger's avatar
      Clarifying comments in db.h (#6768) · b810e62b
      Peter Dillinger 创作于
      And fix a confusingly worded log message
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6768
      Reviewed By: anand1976
      Differential Revision: D21284527
      Pulled By: pdillinger
      fbshipit-source-id: f03c1422c229a901c3a65e524740452349626164
  7. 28 4月, 2020 2 次提交
    • Albert Hse-Lin Chen's avatar
      Fixed minor typo in comment for MergeOperator::FullMergeV2() (#6759) · cc8d16ef
      Albert Hse-Lin Chen 创作于
      Fixed minor typo in comment for FullMergeV2().
      Last operand up to snapshot should be +4 instead of +3.
      Signed-off-by: default avatarAlbert Hse-Lin Chen <hselin@kalista.io>
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6759
      Reviewed By: cheng-chang
      Differential Revision: D21260295
      Pulled By: zhichao-cao
      fbshipit-source-id: cc942306f246c8606538feb30bfdf6df9fb6c54e
    • Peter Dillinger's avatar
      Stats for redundant insertions into block cache (#6681) · 249eff0f
      Peter Dillinger 创作于
      Since read threads do not coordinate on loading data into block
      cache, two threads between Lookup and Insert can end up loading and
      inserting the same data. This is particularly concerning with
      cache_index_and_filter_blocks since those are hot and more likely to
      be race targets if ejected from (or not pre-populated in) the cache.
      Particularly with moves toward disaggregated / network storage, the cost
      of redundant retrieval might be high, and we should at least have some
      hard statistics from which we can estimate impact.
      Example with full filter thrashing "cliff":
          $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((130 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 14181
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 476
          rocksdb.block.cache.data.add COUNT : 12749
          rocksdb.block.cache.data.add.redundant COUNT : 18
          rocksdb.block.cache.filter.add COUNT : 1003
          rocksdb.block.cache.filter.add.redundant COUNT : 217
          rocksdb.block.cache.index.add COUNT : 429
          rocksdb.block.cache.index.add.redundant COUNT : 241
          $ ./db_bench --db=/tmp/rocksdbtest-172704/dbbench --use_existing_db --benchmarks=readrandom,stats --num=200000 --cache_index_and_filter_blocks --cache_size=$((120 * 1024 * 1024)) --bloom_bits=10 --threads=16 -statistics 2>&1 | egrep '^rocksdb.block.cache.(.*add|.*redundant)' | grep -v compress | sort
          rocksdb.block.cache.add COUNT : 1182223
          rocksdb.block.cache.add.failures COUNT : 0
          rocksdb.block.cache.add.redundant COUNT : 302728
          rocksdb.block.cache.data.add COUNT : 31425
          rocksdb.block.cache.data.add.redundant COUNT : 12
          rocksdb.block.cache.filter.add COUNT : 795455
          rocksdb.block.cache.filter.add.redundant COUNT : 130238
          rocksdb.block.cache.index.add COUNT : 355343
          rocksdb.block.cache.index.add.redundant COUNT : 172478
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6681
      Test Plan: Some manual testing (above) and unit test covering key metrics is included
      Reviewed By: ltamasi
      Differential Revision: D21134113
      Pulled By: pdillinger
      fbshipit-source-id: c11497b5f00f4ffdfe919823904e52d0a1a91d87
  8. 22 4月, 2020 2 次提交
    • mrambacher's avatar
      Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389) · 4cbc19d2
      mrambacher 创作于
      The methods in convenience.h are used to compare/convert objects to/from strings.  There is a mishmash of parameters in use here with more needed in the future.  This PR replaces those parameters with a single structure.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
      Reviewed By: siying
      Differential Revision: D21163707
      Pulled By: zhichao-cao
      fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
    • anand76's avatar
      Implement deadline support for MultiGet (#6710) · c1ccd6b6
      anand76 创作于
      Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This
      implementation enforces the deadline in DBImpl, which is fairly high
      level. Its best effort and may not check the deadline after every key
      lookup, but may do so after a batch of keys.
      In subsequent stages, we will extend this to passing a timeout down to the FileSystem.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710
      Test Plan: Add new unit tests
      Reviewed By: riversand963
      Differential Revision: D21149158
      Pulled By: anand1976
      fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e
  9. 21 4月, 2020 1 次提交
    • Peter Dillinger's avatar
      C++20 compatibility (#6697) · 31da5e34
      Peter Dillinger 创作于
      Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:
      * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
      * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
      * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
      * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
      * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
      * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
      * Added GCC 9 C++11 & GCC9 C++20 in Travis
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697
      Test Plan: make check and CI
      Reviewed By: cheng-chang
      Differential Revision: D21020318
      Pulled By: pdillinger
      fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
  10. 18 4月, 2020 1 次提交
    • Yanqin Jin's avatar
      Add IsDirectory() to Env and FS (#6711) · 243852ec
      Yanqin Jin 创作于
      IsDirectory() is a common API to check whether a path is a regular file or
      POSIX: call stat() and use S_ISDIR(st_mode)
      Windows: PathIsDirectoryA() and PathIsDirectoryW()
      HDFS: FileSystem.IsDirectory()
      Java: File.IsDirectory()
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6711
      Test Plan: make check
      Reviewed By: anand1976
      Differential Revision: D21053520
      Pulled By: riversand963
      fbshipit-source-id: 680aadfd8ce982b63689190cf31b3145d5a89e27
  11. 16 4月, 2020 1 次提交
    • Mike Kolupaev's avatar
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 创作于
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      Reviewed By: siying
      Differential Revision: D20786930
      Pulled By: al13n321
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
  12. 14 4月, 2020 2 次提交
  13. 11 4月, 2020 2 次提交
    • Andrew Kryczka's avatar
      explicitly mark backup interfaces non-extensible (#6654) · f08630b9
      Andrew Kryczka 创作于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6654
      Reviewed By: cheng-chang
      Differential Revision: D20878094
      Pulled By: ajkr
      fbshipit-source-id: 94d2561bdb6ffb7fe3773ca07d475337600a5b57
    • Huisheng Liu's avatar
      make iterator return versions between timestamp bounds (#6544) · 9e89ffb7
      Huisheng Liu 创作于
      (Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.
      SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
      base line (commit e860f884):
      seekrandom   : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
      This PR:
      seekrandom   : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)
      db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544
      Reviewed By: ltamasi
      Differential Revision: D20844069
      Pulled By: riversand963
      fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
  14. 08 4月, 2020 1 次提交
  15. 02 4月, 2020 2 次提交
    • Yi Wu's avatar
      Add counter in perf_context to time cipher time (#6596) · 2b02ea25
      Yi Wu 创作于
      Add `encrypt_data_time` and `decrypt_data_time` perf_context counters to time encryption/decryption time when `EnvEncryption` is enabled.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6596
      Test Plan: CI
      Reviewed By: anand1976
      Differential Revision: D20678617
      fbshipit-source-id: 7b57536143aa38509cde011f704de33382169e07
    • Ziyue Yang's avatar
      Add pipelined & parallel compression optimization (#6262) · 03a781a9
      Ziyue Yang 创作于
      This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
      Reviewed By: ajkr
      Differential Revision: D20651306
      fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
  16. 01 4月, 2020 1 次提交
  17. 31 3月, 2020 2 次提交
  18. 30 3月, 2020 1 次提交
    • Zhichao Cao's avatar
      Use FileChecksumGenFactory for SST file checksum (#6600) · e8d332d9
      Zhichao Cao 创作于
      In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
      Test Plan: tested with make asan_check
      Reviewed By: riversand963
      Differential Revision: D20717670
      Pulled By: zhichao-cao
      fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
  19. 29 3月, 2020 1 次提交
    • Cheng Chang's avatar
      Be able to decrease background thread's CPU priority when creating database backup (#6602) · ee50b8d4
      Cheng Chang 创作于
      When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
      This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
      Test Plan: make check
      Reviewed By: siying, zhichao-cao
      Differential Revision: D20683216
      Pulled By: cheng-chang
      fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
  20. 28 3月, 2020 1 次提交
    • Zhichao Cao's avatar
      Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487) · 42468881
      Zhichao Cao 创作于
      In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
      The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
      Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
      Reviewed By: anand1976
      Differential Revision: D20685017
      Pulled By: zhichao-cao
      fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
  21. 27 3月, 2020 1 次提交
    • Levi Tamasi's avatar
      Use function objects as deleters in the block cache (#6545) · 6301dbe7
      Levi Tamasi 创作于
      As the first step of reintroducing eviction statistics for the block
      cache, the patch switches from using simple function pointers as deleters
      to function objects implementing an interface. This will enable using
      deleters that have state, like a smart pointer to the statistics object
      that is to be updated when an entry is removed from the cache. For now,
      the patch adds a deleter template class `SimpleDeleter`, which simply
      casts the `value` pointer to its original type and calls `delete` or
      `delete[]` on it as appropriate. Note: to prevent object lifecycle
      issues, deleters must outlive the cache entries referring to them;
      `SimpleDeleter` ensures this by using the ("leaky") Meyers singleton
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6545
      Test Plan: `make asan_check`
      Reviewed By: siying
      Differential Revision: D20475823
      Pulled By: ltamasi
      fbshipit-source-id: fe354c33dd96d9bafc094605462352305449a22a
  22. 25 3月, 2020 2 次提交
    • Peter Dillinger's avatar
      Update default BBTO::format_version from 2 to 4 (#6582) · 93b80ca7
      Peter Dillinger 创作于
      Version 4 has been around long enough, for compatibility and
      extensive validation, that it should be default.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6582
      Test Plan:
      CI (w.r.t. changing the default; format_version=4 is well
      tested and massively in production at Facebook)
      Reviewed By: siying
      Differential Revision: D20625233
      Pulled By: pdillinger
      fbshipit-source-id: 2f83ed874cffa4a39bc7a66cdf3833b978fbb948
    • Huisheng Liu's avatar
      multiget support for timestamps (#6483) · a6ce5c82
      Huisheng Liu 创作于
      Add timestamp support for MultiGet().
      timestamp from readoptions is honored, and timestamps can be returned along with values.
      MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
      base line (commit 17bef7d3):
        multireadrandom :     104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
      This PR:
        multireadrandom :     104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)
      .\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483
      Reviewed By: anand1976
      Differential Revision: D20498373
      Pulled By: riversand963
      fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
  23. 24 3月, 2020 1 次提交
    • anand76's avatar
      Simplify migration to FileSystem API (#6552) · a9d168cf
      anand76 创作于
      The current Env/FileSystem API separation has a couple of issues -
      1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
      2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.
      This PR solves the above issues and simplifies the migration in the following ways -
      1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
      1a. - This also makes it more robust by ensuring that even if RocksDB
        has some stray calls to Env APIs rather than FileSystem, they will go
        through the same object and thus there is no risk of getting out of
      2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
      PosixEnv with a custom FileSystem implementation. This eliminates an
      indirection to call Env APIs, and relieves the FileSystem developer of
      the burden of having to implement wrappers for the Env APIs.
      3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
      1. New unit tests
      2. make check and make asan_check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552
      Reviewed By: riversand963
      Differential Revision: D20592038
      Pulled By: anand1976
      fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
  24. 21 3月, 2020 1 次提交
    • Yanqin Jin's avatar
      Attempt to recover from db with missing table files (#6334) · fb09ef05
      Yanqin Jin 创作于
      There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
      This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
      `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
      To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.
      Test plan (on devserver):
      $make check
      $COMPILE_WITH_ASAN=1 make all && make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334
      Reviewed By: anand1976
      Differential Revision: D19778960
      Pulled By: riversand963
      fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
  25. 12 3月, 2020 1 次提交
    • Cheng Chang's avatar
      Cache result of GetLogicalBufferSize in Linux (#6457) · 2d9efc9a
      Cheng Chang 创作于
      In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.
      This PR introduces two new APIs:
      1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
      2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.
      Other modifications:
      1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
      2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
      3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457
      Test Plan:
      1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
      2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.
      `make check`
      Differential Revision: D20131243
      Pulled By: cheng-chang
      fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
  26. 11 3月, 2020 1 次提交
  27. 10 3月, 2020 1 次提交
    • Yanqin Jin's avatar
      Support options.max_open_files != -1 with FIFO compaction (#6503) · fd1da221
      Yanqin Jin 创作于
      Allow user to specify options.max_open_files != -1 with FIFO compaction.
      If max_open_files != -1, not all table files are kept open.
      In the past, FIFO style compaction requires all table files to be open in order
      to read file creation time from table properties. Later, we added file creation
      time to MANIFEST, making it possible to read file creation time without opening
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503
      Test Plan: make check
      Differential Revision: D20353758
      Pulled By: riversand963
      fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296
  28. 07 3月, 2020 2 次提交
    • Yanqin Jin's avatar
      Iterator with timestamp (#6255) · d93812c9
      Yanqin Jin 创作于
      Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
      Create an iterator with timestamp.
      read_opts.timestamp = &ts;
      auto* iter = db->NewIterator(read_opts);
      // target is key without timestamp.
      for (iter->Seek(target); iter->Valid(); iter->Next()) {}
      for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
      delete iter;
      read_opts.timestamp = &ts1;
      // lower_bound and upper_bound are without timestamp.
      read_opts.iterate_lower_bound = &lower_bound;
      read_opts.iterate_upper_bound = &upper_bound;
      auto* iter1 = db->NewIterator(read_opts);
      // Do Seek or SeekToFirst()
      delete iter1;
      Test plan (dev server)
      $make check
      Simple benchmarking (dev server)
      1. The overhead introduced by this PR even when timestamp is disabled.
      key size: 16 bytes
      value size: 100 bytes
      Entries: 1000000
      Data reside in main memory, and try to stress iterator.
      Repeated three times on master and this PR.
      - Seek without next
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
      master: 159047.0 ops/sec
      this PR: 158922.3 ops/sec (2% drop in throughput)
      - Seek and next 10 times
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
      master: 109539.3 ops/sec
      this PR: 107519.7 ops/sec (2% drop in throughput)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
      Differential Revision: D19438227
      Pulled By: riversand963
      fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
    • Otto Kekäläinen's avatar
      Fix spelling: commited -> committed (#6481) · f6c2777d
      Otto Kekäläinen 创作于
      In most places in the code the variable names are spelled correctly as
      COMMITTED but in a couple places not. This fixes them and ensures the
      variable is always called COMMITTED everywhere.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481
      Differential Revision: D20306776
      Pulled By: pdillinger
      fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022