1. 17 6月, 2020 1 次提交
    • Yanqin Jin's avatar
      Let best-efforts recovery ignore CURRENT file (#6970) · 97a69f43
      Yanqin Jin 创作于
      Summary:
      Best-efforts recovery does not check the content of CURRENT file to determine which MANIFEST to recover from. However, it still checks the presence of CURRENT file to determine whether to create a new DB during `open()`. Therefore, we can tweak the logic in `open()` a little bit so that best-efforts recovery does not rely on CURRENT file at all.
      
      Test plan (dev server):
      make check
      ./db_basic_test --gtest_filter=DBBasicTest.RecoverWithNoCurrentFile
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6970
      
      Reviewed By: anand1976
      
      Differential Revision: D22013990
      
      Pulled By: riversand963
      
      fbshipit-source-id: db552a1868c60ed70e1f7cd252a3a076eb8ea58f
      97a69f43
  2. 13 6月, 2020 1 次提交
  3. 12 6月, 2020 2 次提交
    • Yanqin Jin's avatar
      Fail point-in-time WAL recovery upon IOError reading WAL (#6963) · 717749f4
      Yanqin Jin 创作于
      Summary:
      If `options.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery`, RocksDB stops replaying WAL once hitting an error and discards the rest of the WAL. This can lead to data loss if the error occurs at an offset smaller than the last sync'ed offset.
      Ideally, RocksDB point-in-time recovery should permit recovery if the error occurs after last synced offset while fail recovery if error occurs before the last synced offset. However, RocksDB does not track the synced offset of WALs. Consequently, RocksDB does not know whether an error occurs before or after the last synced offset. An error can be one of the following.
      - WAL record checksum mismatch. This can result from both corruption of synced data and dropping of unsynced data during shutdown. We cannot be sure which one. In order not to defeat the original motivation to permit the latter case, we keep the original behavior of point-in-time WAL recovery.
      - IOError. This means the WAL can be bad, an indicator of whole file becoming unavailable, not to mention synced part of the WAL. Therefore, we choose to modify the behavior of point-in-time recovery and fail the database recovery.
      
      Test plan (devserver):
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6963
      
      Reviewed By: ajkr
      
      Differential Revision: D22011083
      
      Pulled By: riversand963
      
      fbshipit-source-id: f9cbf29a37dc5cc40d3fa62f89eed1ad67ca1536
      717749f4
    • Zhichao Cao's avatar
      Ingest SST files with checksum information (#6891) · b3585a11
      Zhichao Cao 创作于
      Summary:
      Application can ingest SST files with file checksum information, such that during ingestion, DB is able to check data integrity and identify of the SST file. The PR introduces generate_and_verify_file_checksum to IngestExternalFileOption to control if the ingested checksum information should be verified with the generated checksum.
      
          1. If generate_and_verify_file_checksum options is *FALSE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enables the SST file checksum and the checksum function name matches the checksum function name in DB, we trust the ingested checksum, store it in Manifest. If the checksum function name does not match, we treat that as an error and fail the IngestExternalFile() call.
          2. If generate_and_verify_file_checksum options is *TRUE*: *1)* if DB does not enable SST file checksum, the checksum information ingested will be ignored; *2)* if DB enable the SST file checksum, we will use the checksum generator from DB to calculate the checksum for each ingested SST files after they are copied or moved. Then, compare the checksum results with the ingested checksum information: _A)_ if the checksum function name does not match, _verification always report true_ and we store the DB generated checksum information in Manifest. _B)_ if the checksum function name mach, and checksum match, ingestion continues and stores the checksum information in the Manifest. Otherwise, terminate file ingestion and report file corruption.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6891
      
      Test Plan: added unit test, pass make asan_check
      
      Reviewed By: pdillinger
      
      Differential Revision: D21935988
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 7b55f486632db467e76d72602218d0658aa7f6ed
      b3585a11
  4. 11 6月, 2020 1 次提交
    • Andrew Kryczka's avatar
      save a key comparison in block seeks (#6646) · e6be168a
      Andrew Kryczka 创作于
      Summary:
      This saves up to two key comparisons in block seeks. The first key
      comparison saved is a redundant key comparison against the restart key
      where the linear scan starts. This comparison is saved in all cases
      except when the found key is in the first restart interval. The
      second key comparison saved is a redundant key comparison against the
      restart key where the linear scan ends. This is only saved in cases
      where all keys in the restart interval are less than the target
      (probability roughly `1/restart_interval`).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646
      
      Test Plan:
      ran a benchmark with mostly default settings and counted key comparisons
      
      before: `user_key_comparison_count = 19399529`
      after: `user_key_comparison_count = 18431498`
      
      setup command:
      
      ```
      $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
      ```
      
      benchmark command:
      
      ```
      $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D20849707
      
      Pulled By: ajkr
      
      fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac
      e6be168a
  5. 10 6月, 2020 1 次提交
  6. 09 6月, 2020 2 次提交
    • anand76's avatar
      Fix a bug in looking up duplicate keys with MultiGet (#6953) · 1fb3593f
      anand76 创作于
      Summary:
      When MultiGet is called with duplicate keys, and the key matches the
      largest key in an SST file and the value type is merge, only the first
      instance of the duplicate key is returned with correct results. This is
      due to the incorrect assumption that if a key in a batch is equal to the
      largest key in the file, the next key cannot be present in that file.
      
      Tests:
      Add a new unit test
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6953
      
      Reviewed By: cheng-chang
      
      Differential Revision: D21935898
      
      Pulled By: anand1976
      
      fbshipit-source-id: a2cc327a15150e23fd997546ca64d1c33021cb4c
      1fb3593f
    • Zitan Chen's avatar
      Implement a new subcommand "identify" for sst_dump (#6943) · 119b26fa
      Zitan Chen 创作于
      Summary:
      Implemented a subcommand of sst_dump called identify, which determines whether a file is an SST file or identifies and lists all the SST files in a directory;
      
      This update also fixes the problem that sst_dump exits with a success state even if target file/directory does not exist/is not an SST file/is empty/is corrupted.
      
      One test is added to sst_dump_test.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6943
      
      Test Plan: Passed make check and a few manual tests
      
      Reviewed By: pdillinger
      
      Differential Revision: D21928985
      
      Pulled By: gg814
      
      fbshipit-source-id: 9a8b48e0cf1a0e96b13f42b690aba8ad981afad3
      119b26fa
  7. 06 6月, 2020 1 次提交
    • anand76's avatar
      Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909) · 98b0cbea
      anand76 创作于
      Summary:
      The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.
      
      Tests:
      Add a test in block_based_table_reader_test.cc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909
      
      Reviewed By: pdillinger
      
      Differential Revision: D21833922
      
      Pulled By: anand1976
      
      fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
      98b0cbea
  8. 05 6月, 2020 1 次提交
  9. 04 6月, 2020 2 次提交
    • Zitan Chen's avatar
      API change: DB::OpenForReadOnly will not write to the file system unless... · 02df00d9
      Zitan Chen 创作于
      API change: DB::OpenForReadOnly will not write to the file system unless create_if_missing is true (#6900)
      
      Summary:
      DB::OpenForReadOnly will not write anything to the file system (i.e., create directories or files for the DB) unless create_if_missing is true.
      
      This change also fixes some subcommands of ldb, which write to the file system even if the purpose is for readonly.
      
      Two tests for this updated behavior of DB::OpenForReadOnly are also added.
      
      Other minor changes:
      1. Updated HISTORY.md to include this API change of DB::OpenForReadOnly;
      2. Updated the help information for the put and batchput subcommands of ldb with the option [--create_if_missing];
      3. Updated the comment of Env::DeleteDir to emphasize that it returns OK only if the directory to be deleted is empty.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6900
      
      Test Plan: passed make check; also manually tested a few ldb subcommands
      
      Reviewed By: pdillinger
      
      Differential Revision: D21822188
      
      Pulled By: gg814
      
      fbshipit-source-id: 604cc0f0d0326a937ee25a32cdc2b512f9a3be6e
      02df00d9
    • Levi Tamasi's avatar
      Mention the consistency check improvement in HISTORY.md (#6924) · 0b8c549b
      Levi Tamasi 创作于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6924
      
      Reviewed By: cheng-chang
      
      Differential Revision: D21865662
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 83a01bcbb779cfba941154a36a9e735293a93211
      0b8c549b
  10. 03 6月, 2020 1 次提交
    • Peter Dillinger's avatar
      For ApproximateSizes, pro-rate table metadata size over data blocks (#6784) · 14eca6bf
      Peter Dillinger 创作于
      Summary:
      The implementation of GetApproximateSizes was inconsistent in
      its treatment of the size of non-data blocks of SST files, sometimes
      including and sometimes now. This was at its worst with large portion
      of table file used by filters and querying a small range that crossed
      a table boundary: the size estimate would include large filter size.
      
      It's conceivable that someone might want only to know the size in terms
      of data blocks, but I believe that's unlikely enough to ignore for now.
      Similarly, there's no evidence the internal function AppoximateOffsetOf
      is used for anything other than a one-sided ApproximateSize, so I intend
      to refactor to remove redundancy in a follow-up commit.
      
      So to fix this, GetApproximateSizes (and implementation details
      ApproximateSize and ApproximateOffsetOf) now consistently include in
      their returned sizes a portion of table file metadata (incl filters
      and indexes) based on the size portion of the data blocks in range. In
      other words, if a key range covers data blocks that are X% by size of all
      the table's data blocks, returned approximate size is X% of the total
      file size. It would technically be more accurate to attribute metadata
      based on number of keys, but that's not computationally efficient with
      data available and rarely a meaningful difference.
      
      Also includes miscellaneous comment improvements / clarifications.
      
      Also included is a new approximatesizerandom benchmark for db_bench.
      No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
      
      Test Plan:
      Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
      Old code running new test...
      
          [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
          db/db_test.cc:1562: Failure
          Expected: (size) <= (11 * 100), actual: 9478 vs 1100
      
      Other tests updated to reflect consistent accounting of metadata.
      
      Reviewed By: siying
      
      Differential Revision: D21334706
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
      14eca6bf
  11. 29 5月, 2020 1 次提交
    • Andrew Kryczka's avatar
      avoid `IterKey::UpdateInternalKey()` in `BlockIter` (#6843) · c5abf78b
      Andrew Kryczka 创作于
      Summary:
      `IterKey::UpdateInternalKey()` is an error-prone API as it's
      incompatible with `IterKey::TrimAppend()`, which is used for
      decoding delta-encoded internal keys. This PR stops using it in
      `BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
      buffer when needed. The logic for safely getting a Slice with global
      seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
      `BinarySeek()` is also migrated to use this API (previously it ignored
      global seqno entirely).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843
      
      Test Plan:
      benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
      created by regular flush; "ingestion_db" created by ingesting a file. Both
      DBs have same contents.
      
      ```
      $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
      $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
      $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
      ```
      
      benchmark run command:
      ```
      TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
      ```
      
      results:
      
      | DB | code | throughput |
      |---|---|---|
      | normal_db | master |  267.9 |
      | normal_db   |    PR6843 | 254.2 (-5.1%) |
      | ingestion_db |   master |  259.6 |
      | ingestion_db |   PR6843 | 250.5 (-3.5%) |
      
      Reviewed By: pdillinger
      
      Differential Revision: D21562604
      
      Pulled By: ajkr
      
      fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
      c5abf78b
  12. 28 5月, 2020 1 次提交
    • Akanksha Mahajan's avatar
      Allow MultiGet users to limit cumulative value size (#6826) · bcefc59e
      Akanksha Mahajan 创作于
      Summary:
      1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
      2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826
      
      Test Plan:
      1. make check -j64
      	   2. Add a new unit test case
      
      Reviewed By: anand1976
      
      Differential Revision: D21471483
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
      bcefc59e
  13. 21 5月, 2020 1 次提交
    • Zhichao Cao's avatar
      Generate file checksum in SstFileWriter (#6859) · 545e14b5
      Zhichao Cao 创作于
      Summary:
      If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.
      
      This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the  ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859
      
      Test Plan: add unit test and pass make asan_check.
      
      Reviewed By: ajkr
      
      Differential Revision: D21656247
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
      545e14b5
  14. 13 5月, 2020 1 次提交
    • sdong's avatar
      sst_dump to reduce number of file reads (#6836) · 4a4b8a13
      sdong 创作于
      Summary:
      sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
      1. --readahead_size is added, so that users can specify readahead size when scanning the data.
      2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
      3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836
      
      Test Plan: Add a test that covers this new feature.
      
      Reviewed By: pdillinger
      
      Differential Revision: D21516607
      
      fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
      4a4b8a13
  15. 09 5月, 2020 2 次提交
    • sdong's avatar
      Improve ldb consistency checks (#6802) · a50ea71c
      sdong 创作于
      Summary:
      When using ldb, users cannot turn on force consistency check in most commands, while they cannot use checksonsistnecy with --try_load_options. The change fixes both by:
      1. checkconsistency now calls OpenDB() so that it gets all the options loading and sanitized options logic
      2. use options.check_consistency_checks = true by default, and add a --disable_consistency_checks to turn it off.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6802
      
      Test Plan: Add a new unit test. Some manual tests with corrupted DBs.
      
      Reviewed By: pdillinger
      
      Differential Revision: D21388051
      
      fbshipit-source-id: 8d122732d391b426e3982a1c3232a8e3763ffad0
      a50ea71c
    • Yanqin Jin's avatar
      Fix a few bugs in best-efforts recovery (#6824) · e72e2167
      Yanqin Jin 创作于
      Summary:
      1. Update column_family_memtables_ to point to latest column_family_set in
         version_set after recovery.
      2. Normalize file paths passed by application so that directories end with '/'
         or '\\'.
      3. In addition to missing files, corrupted files are also ignored in
         best-efforts recovery.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6824
      
      Test Plan: COMPILE_WITH_ASAN=1 make check
      
      Reviewed By: anand1976
      
      Differential Revision: D21463905
      
      Pulled By: riversand963
      
      fbshipit-source-id: c48db8843cc93c8c1c7139c474b64e6f775307d2
      e72e2167
  16. 08 5月, 2020 4 次提交
    • anand76's avatar
      Fix race due to delete triggered compaction in Universal compaction mode (#6799) · 94265234
      anand76 创作于
      Summary:
      Delete triggered compaction in universal compaction mode was causing a corruption when scheduled in parallel with other compactions.
      1. When num_levels = 1, a file marked for compaction may be picked along with all older files in L0, without checking if any of them are already being compaction. This can cause unpredictable results like resurrection of older versions of keys or deleted keys.
      2. When num_levels > 1, a delete triggered compaction would not get scheduled if it overlaps with a running regular compaction. However, the reverse is not true. This is due to the fact that in ```UniversalCompactionBuilder::CalculateSortedRuns```, it assumes that entire sorted runs are picked for compaction and only checks the first file in a sorted run to determine conflicts. This is violated by a delete triggered compaction as it works on a subset of a sorted run.
      
      Fix the bug for num_levels > 1, and disable the feature for now when num_levels = 1. After disabling this feature, files would still get marked for compaction, but no compaction would get scheduled.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6799
      
      Reviewed By: pdillinger
      
      Differential Revision: D21431286
      
      Pulled By: anand1976
      
      fbshipit-source-id: ae9f0bdb1d6ae2f10284847db731c23f43af164a
      94265234
    • Andrew Kryczka's avatar
      Fixup HISTORY.md for e9ba4ba3 "validate range tombstone covers positiv… (#6825) · 3730b05d
      Andrew Kryczka 创作于
      Summary:
      …e range"
      
      Moved it from the wrong section (6.10) to the right section (Unreleased).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6825
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D21464577
      
      Pulled By: ajkr
      
      fbshipit-source-id: a836b4ab10be2464182826f9411c9c424c933b70
      3730b05d
    • Peter Dillinger's avatar
      Fix false NotFound from batched MultiGet with kHashSearch (#6821) · b27a1448
      Peter Dillinger 创作于
      Summary:
      The error is assigning KeyContext::s to NotFound status in a
      table reader for a "not found in this table" case, which skips searching
      in later tables, like only a delete should. (The hash search index iterator
      is the only one that can return status NotFound even if Valid() == false.)
      
      This was detected by intermittent failure in
      MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821
      
      Test Plan: modified existing unit test to reproduce problem
      
      Reviewed By: anand1976
      
      Differential Revision: D21450469
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
      b27a1448
    • Andrew Kryczka's avatar
      validate range tombstone covers positive range (#6788) · e9ba4ba3
      Andrew Kryczka 创作于
      Summary:
      We found some files containing nothing but negative range tombstones,
      and unsurprisingly their metadata specified a negative range, which made
      things crash. Time to add a bit of user input validation.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6788
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D21343719
      
      Pulled By: ajkr
      
      fbshipit-source-id: f1c16e4c3e9fa150958c8c866176632a3206fb74
      e9ba4ba3
  17. 07 5月, 2020 2 次提交
  18. 05 5月, 2020 2 次提交
    • Yanqin Jin's avatar
      Fix db_stress when GetLiveFiles() flushes dropped CF (#6805) · 5a61e786
      Yanqin Jin 创作于
      Summary:
      Current impl. of db_stress will abort verification and report failure if
      GetLiveFiles() causes a dropped column family to be flushed. This is not
      desired.
      To fix, this PR makes the following change:
      In GetLiveFiles, if flush is triggered and returns
      Status::IsColumnFamilyDropped(), then set status to Status::OK().
      This is OK because dropped column families will be skipped during the rest of
      this function, and valid column families will have their live files returned to
      caller.
      
      Test plan (dev server):
      make check
      ./db_stress -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
      ./db_stress -disable_wal=1 -reopen=0 -ops_per_thread=1000 -get_live_files_one_in=100 -clear_column_family_one_in=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6805
      
      Reviewed By: ltamasi
      
      Differential Revision: D21390044
      
      Pulled By: riversand963
      
      fbshipit-source-id: de67846b95a4f1b88aa0a30c3d70c43cc68625b9
      5a61e786
    • sdong's avatar
      Avoid Swallowing Some File Consistency Checking Bugs (#6793) · 680c4163
      sdong 创作于
      Summary:
      We are swallowing some file consistency checking failures. This is not expected. We are fixing two cases: DB reopen and manifest dump.
      More places are not fixed and need follow-up.
      
      Error from CheckConsistencyForDeletes() is also swallowed, which is not fixed in this PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6793
      
      Test Plan: Add a unit test to cover the reopen case.
      
      Reviewed By: riversand963
      
      Differential Revision: D21366525
      
      fbshipit-source-id: eb438a322237814e8d5125f916a3c6de97f39ded
      680c4163
  19. 01 5月, 2020 1 次提交
  20. 29 4月, 2020 2 次提交
    • Peter Dillinger's avatar
      Basic MultiGet support for partitioned filters (#6757) · bae6f586
      Peter Dillinger 创作于
      Summary:
      In MultiGet, access each applicable filter partition only once
      per batch, rather than for each applicable key. Also,
      
      * Fix Bloom stats for MultiGet
      * Fix/refactor MultiGetContext::Range::KeysLeft, including
      * Add efficient BitsSetToOne implementation
      * Assert that MultiGetContext::Range does not go beyond shift range
      
      Performance test: Generate db:
      
          $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
          ...
      
      Before (middle performing run of three; note some missing Bloom stats):
      
          $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
          multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
          rocksdb.block.cache.filter.hit COUNT : 83443275
          rocksdb.bloom.filter.useful COUNT : 0
          rocksdb.bloom.filter.full.positive COUNT : 0
          rocksdb.bloom.filter.full.true.positive COUNT : 7931450
          rocksdb.number.multiget.get COUNT : 385984
          rocksdb.number.multiget.keys.read COUNT : 12351488
          rocksdb.number.multiget.bytes.read COUNT : 793145000
          rocksdb.number.multiget.keys.found COUNT : 7931450
      
      After (middle performing run of three):
      
          $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
          multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
          rocksdb.block.cache.filter.hit COUNT : 49856682
          rocksdb.bloom.filter.useful COUNT : 45684579
          rocksdb.bloom.filter.full.positive COUNT : 10395458
          rocksdb.bloom.filter.full.true.positive COUNT : 9908456
          rocksdb.number.multiget.get COUNT : 481984
          rocksdb.number.multiget.keys.read COUNT : 15423488
          rocksdb.number.multiget.bytes.read COUNT : 990845600
          rocksdb.number.multiget.keys.found COUNT : 9908456
      
      So that's about 25% higher throughput even for random keys
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757
      
      Test Plan: unit test included
      
      Reviewed By: anand1976
      
      Differential Revision: D21243256
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
      bae6f586
    • Peter Dillinger's avatar
      HISTORY.md update for bzip upgrade (#6767) · a7f0b27b
      Peter Dillinger 创作于
      Summary:
      See https://github.com/facebook/rocksdb/issues/6714 and https://github.com/facebook/rocksdb/issues/6703
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6767
      
      Reviewed By: riversand963
      
      Differential Revision: D21283307
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 8463bec725669d13846c728ad4b5bde43f9a84f8
      a7f0b27b
  21. 28 4月, 2020 4 次提交
  22. 25 4月, 2020 1 次提交
    • Cheng Chang's avatar
      Reduce memory copies when fetching and uncompressing blocks from SST files (#6689) · 40497a87
      Cheng Chang 创作于
      Summary:
      In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode.
      This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode.
      
      Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`.
      
      In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689
      
      Test Plan: A new unit test `block_fetcher_test` is added.
      
      Reviewed By: anand1976
      
      Differential Revision: D21006729
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980
      40497a87
  23. 24 4月, 2020 1 次提交
  24. 22 4月, 2020 1 次提交
  25. 21 4月, 2020 1 次提交
    • Akanksha Mahajan's avatar
      Set max_background_flushes dynamically (#6701) · 03a1d95d
      Akanksha Mahajan 创作于
      Summary:
      1. Add changes so that max_background_flushes can be set dynamically.
                         2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the
                              max_background_flushes dynamically using SetDBOptions.
      
      TestPlan:  1. make -j64 check
                        2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701
      
      Reviewed By: ajkr
      
      Differential Revision: D21028010
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1
      03a1d95d
  26. 18 4月, 2020 1 次提交
    • Yanqin Jin's avatar
      Add IsDirectory() to Env and FS (#6711) · 243852ec
      Yanqin Jin 创作于
      Summary:
      IsDirectory() is a common API to check whether a path is a regular file or
      directory.
      POSIX: call stat() and use S_ISDIR(st_mode)
      Windows: PathIsDirectoryA() and PathIsDirectoryW()
      HDFS: FileSystem.IsDirectory()
      Java: File.IsDirectory()
      ...
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6711
      
      Test Plan: make check
      
      Reviewed By: anand1976
      
      Differential Revision: D21053520
      
      Pulled By: riversand963
      
      fbshipit-source-id: 680aadfd8ce982b63689190cf31b3145d5a89e27
      243852ec
  27. 16 4月, 2020 1 次提交
    • Mike Kolupaev's avatar
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 创作于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de