1. 01 May, 2020 1 commit
    • anand76's avatar
      Pass a timeout to FileSystem for random reads (#6751) · ab13d43e
      anand76 authored
      Summary:
      Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
      
      For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
      
      Tests:
      Update existing unit tests to verify the correct timeout value is being passed
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
      
      Reviewed By: riversand963
      
      Differential Revision: D21285631
      
      Pulled By: anand1976
      
      fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
      ab13d43e
  2. 22 Apr, 2020 1 commit
    • anand76's avatar
      Implement deadline support for MultiGet (#6710) · c1ccd6b6
      anand76 authored
      Summary:
      Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This
      implementation enforces the deadline in DBImpl, which is fairly high
      level. Its best effort and may not check the deadline after every key
      lookup, but may do so after a batch of keys.
      
      In subsequent stages, we will extend this to passing a timeout down to the FileSystem.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710
      
      Test Plan: Add new unit tests
      
      Reviewed By: riversand963
      
      Differential Revision: D21149158
      
      Pulled By: anand1976
      
      fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e
      c1ccd6b6
  3. 11 Apr, 2020 1 commit
    • Huisheng Liu's avatar
      make iterator return versions between timestamp bounds (#6544) · 9e89ffb7
      Huisheng Liu authored
      Summary:
      (Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.
      
      SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
      base line (commit e860f884):
      seekrandom   : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
      This PR:
      seekrandom   : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)
      
      db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544
      
      Reviewed By: ltamasi
      
      Differential Revision: D20844069
      
      Pulled By: riversand963
      
      fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
      9e89ffb7
  4. 30 Mar, 2020 1 commit
    • Zhichao Cao's avatar
      Use FileChecksumGenFactory for SST file checksum (#6600) · e8d332d9
      Zhichao Cao authored
      Summary:
      In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
      
      Test Plan: tested with make asan_check
      
      Reviewed By: riversand963
      
      Differential Revision: D20717670
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
      e8d332d9
  5. 29 Mar, 2020 1 commit
    • Cheng Chang's avatar
      Be able to decrease background thread's CPU priority when creating database backup (#6602) · ee50b8d4
      Cheng Chang authored
      Summary:
      When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries.
      
      This PR makes it possible to decrease CPU priority of these threads when creating a new backup.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602
      
      Test Plan: make check
      
      Reviewed By: siying, zhichao-cao
      
      Differential Revision: D20683216
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84
      ee50b8d4
  6. 24 Mar, 2020 1 commit
    • anand76's avatar
      Simplify migration to FileSystem API (#6552) · a9d168cf
      anand76 authored
      Summary:
      The current Env/FileSystem API separation has a couple of issues -
      1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
      2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.
      
      This PR solves the above issues and simplifies the migration in the following ways -
      1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
      1a. - This also makes it more robust by ensuring that even if RocksDB
        has some stray calls to Env APIs rather than FileSystem, they will go
        through the same object and thus there is no risk of getting out of
        sync.
      2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
      PosixEnv with a custom FileSystem implementation. This eliminates an
      indirection to call Env APIs, and relieves the FileSystem developer of
      the burden of having to implement wrappers for the Env APIs.
      3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
      ```NewLogger()```
      
      Tests:
      1. New unit tests
      2. make check and make asan_check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552
      
      Reviewed By: riversand963
      
      Differential Revision: D20592038
      
      Pulled By: anand1976
      
      fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
      a9d168cf
  7. 21 Mar, 2020 1 commit
    • Yanqin Jin's avatar
      Attempt to recover from db with missing table files (#6334) · fb09ef05
      Yanqin Jin authored
      Summary:
      There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
      This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
      `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
      To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.
      
      Test plan (on devserver):
      ```
      $make check
      $COMPILE_WITH_ASAN=1 make all && make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334
      
      Reviewed By: anand1976
      
      Differential Revision: D19778960
      
      Pulled By: riversand963
      
      fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
      fb09ef05
  8. 21 Feb, 2020 1 commit
    • sdong's avatar
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong authored
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  9. 11 Feb, 2020 1 commit
    • Zhichao Cao's avatar
      Checksum for each SST file and stores in MANIFEST (#6216) · 4369f2c7
      Zhichao Cao authored
      Summary:
      In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
      
      Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
      
      Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
      
      Differential Revision: D19171461
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
      4369f2c7
  10. 04 Feb, 2020 1 commit
    • Mike Kolupaev's avatar
      Add an option to prevent DB::Open() from querying sizes of all sst files (#6353) · 637e64b9
      Mike Kolupaev authored
      Summary:
      When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
      
      If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
      
      We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
      
      Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
      
      Differential Revision: D19656425
      
      Pulled By: al13n321
      
      fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
      637e64b9
  11. 30 Jan, 2020 1 commit
  12. 29 Jan, 2020 1 commit
    • sdong's avatar
      Add ReadOptions.auto_prefix_mode (#6314) · 8f2bee67
      sdong authored
      Summary:
      Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
      
      Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
      
      Differential Revision: D19458717
      
      fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
      8f2bee67
  13. 18 Dec, 2019 1 commit
    • 解轶伦's avatar
      delete superversions in BackgroundCallPurge (#6146) · 39fcaf82
      解轶伦 authored
      Summary:
      I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
      
      Then I found "delete sv" in ~SuperVersion() takes the time.
      
      The backtrace looks like this
      
      DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
      DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
      
      I think it's better to delete in a background thread,  please review it。
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
      
      Differential Revision: D18972066
      
      fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
      39fcaf82
  14. 14 Dec, 2019 1 commit
    • anand76's avatar
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 authored
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  15. 28 Nov, 2019 1 commit
  16. 19 Sep, 2019 1 commit
  17. 13 Sep, 2019 1 commit
    • Lingjing You's avatar
      Add insert hints for each writebatch (#5728) · 1a928c22
      Lingjing You authored
      Summary:
      Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it.
      
      Bench result (qps):
      
      `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4`
      
      master:
      
      | batch size \ thread num | 1       | 2       | 4       | 8       |
      | ----------------------- | ------- | ------- | ------- | ------- |
      | 1                       | 387883  | 220790  | 308294  | 490998  |
      | 10                      | 1397208 | 978911  | 1275684 | 1733395 |
      | 100                     | 2045414 | 1589927 | 1798782 | 2681039 |
      | 1000                    | 2228038 | 1698252 | 1839877 | 2863490 |
      
      fillseq with writebatch hint:
      
      | batch size \ thread num | 1       | 2       | 4       | 8       |
      | ----------------------- | ------- | ------- | ------- | ------- |
      | 1                       | 286005  | 223570  | 300024  | 466981  |
      | 10                      | 970374  | 813308  | 1399299 | 1753588 |
      | 100                     | 1962768 | 1983023 | 2676577 | 3086426 |
      | 1000                    | 2195853 | 2676782 | 3231048 | 3638143 |
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728
      
      Differential Revision: D17297240
      
      fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c
      1a928c22
  18. 12 Sep, 2019 1 commit
  19. 03 Sep, 2019 1 commit
    • Vijay Nadimpalli's avatar
      Persistent globally unique DB ID in manifest (#5725) · 979fbdc6
      Vijay Nadimpalli authored
      Summary:
      Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
      The goal of this PR is solve this problem by moving the  DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725
      
      Test Plan: Added unit tests.
      
      Differential Revision: D16963840
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
      979fbdc6
  20. 21 Aug, 2019 1 commit
  21. 31 Jul, 2019 1 commit
    • Eli Pozniansky's avatar
      Improve CPU Efficiency of ApproximateSize (part 2) (#5609) · 4834dab5
      Eli Pozniansky authored
      Summary:
      In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609
      
      Differential Revision: D16433481
      
      Pulled By: elipoz
      
      fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
      4834dab5
  22. 26 Jul, 2019 1 commit
  23. 25 Jul, 2019 1 commit
    • Maysam Yabandeh's avatar
      Declare snapshot refresh incompatible with delete range (#5625) · d9dc6b46
      Maysam Yabandeh authored
      Summary:
      The ::snap_refresh_nanos option is incompatible with DeleteRange feature. Currently the code relies on range_del_agg.IsEmpty() to disable it if there are range delete tombstones. However ::IsEmpty does not guarantee that there is no RangeDelete tombstones in the SST files. The patch declares the two features incompatible in inline comments until we later figure how to properly detect the presence of RangeDelete tombstones in compaction inputs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5625
      
      Differential Revision: D16468218
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: bd7beca278bc7e1db75e7ee4522d05a3a6ca86f4
      d9dc6b46
  24. 23 Jul, 2019 1 commit
  25. 20 Jul, 2019 1 commit
  26. 18 Jul, 2019 1 commit
    • Venki Pallipadi's avatar
      Export Import sst files (#5495) · 22ce4624
      Venki Pallipadi authored
      Summary:
      Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135
      
      This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
      "Add support for taking snapshot of a column family and creating column family from a given CF snapshot"
      
      We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.
      
      (1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
      // Exports all live SST files of a specified Column Family onto export_dir,
      // returning SST files information in metadata.
      // - SST files will be created as hard links when the directory specified
      //   is in the same partition as the db directory, copied otherwise.
      // - export_dir should not already exist and will be created by this API.
      // - Always triggers a flush.
      virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                        const std::string& export_dir,
                                        ExportImportFilesMetaData** metadata);
      
      Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
      metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
      returning the list of file metadata.
      
      (2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
      // CreateColumnFamilyWithImport() will create a new column family with
      // column_family_name and import external SST files specified in metadata into
      // this column family.
      // (1) External SST files can be created using SstFileWriter.
      // (2) External SST files can be exported from a particular column family in
      //     an existing DB.
      // Option in import_options specifies whether the external files are copied or
      // moved (default is copy). When option specifies copy, managing files at
      // external_file_path is caller's responsibility. When option specifies a
      // move, the call ensures that the specified files at external_file_path are
      // deleted on successful return and files are not modified on any error
      // return.
      // On error return, column family handle returned will be nullptr.
      // ColumnFamily will be present on successful return and will not be present
      // on error return. ColumnFamily may be present on any crash during this call.
      virtual Status CreateColumnFamilyWithImport(
          const ColumnFamilyOptions& options, const std::string& column_family_name,
          const ImportColumnFamilyOptions& import_options,
          const ExportImportFilesMetaData& metadata,
          ColumnFamilyHandle** handle);
      
      Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.
      
      If incoming sequence number is higher than current local sequence number, local sequence
      number is updated to reflect this.
      
      Note, as the sst files is are being moved across Column Families, Column Family name in sst file
      will no longer match the actual column family on destination DB. The API does not modify Column
      Family name or id in the sst files being imported.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495
      
      Differential Revision: D16018881
      
      fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
      22ce4624
  27. 20 Jun, 2019 1 commit
  28. 18 Jun, 2019 1 commit
  29. 06 Jun, 2019 1 commit
    • Yanqin Jin's avatar
      Add support for timestamp in Get/Put (#5079) · 340ed4fa
      Yanqin Jin authored
      Summary:
      It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
      
      Test plan (on devserver):
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      All tests must pass.
      
      We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
      ```
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
      ```
      Repeat for 6 times for both versions.
      
      Results are as follows:
      ```
      |        | readrandom | fillrandom |
      | master | 16.77 MB/s | 47.05 MB/s |
      | PR5079 | 16.44 MB/s | 47.03 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
      
      Differential Revision: D15132946
      
      Pulled By: riversand963
      
      fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
      340ed4fa
  30. 24 May, 2019 1 commit
    • haoyuhuang's avatar
      Provide an option so that SST ingestion won't fall back to copy after hard linking fails (#5333) · 74a334a2
      haoyuhuang authored
      Summary:
      RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named  'failed_move_fall_back_to_copy' for users to choose which behavior they want.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333
      
      Differential Revision: D15457597
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214
      74a334a2
  31. 20 May, 2019 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313) · 5c0e3041
      Maysam Yabandeh authored
      Summary:
      WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
      The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
      
      Differential Revision: D15379977
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
      5c0e3041
  32. 14 May, 2019 1 commit
    • Maysam Yabandeh's avatar
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh authored
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  33. 04 May, 2019 1 commit
    • Maysam Yabandeh's avatar
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh authored
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
  34. 02 May, 2019 1 commit
  35. 27 Apr, 2019 1 commit
    • Sagar Vemuri's avatar
      Improve explicit user readahead performance (#5246) · 3548e422
      Sagar Vemuri authored
      Summary:
      Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
      
      1. Stop creating new table readers when the user explicitly sets readahead size.
      2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
      3. Add `readahead_size` to db_bench.
      
      **Benchmarks:**
      https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
      
      For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
      For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
      
      **Test Plan:**
      Updated `DBIteratorTest.ReadAhead` test to make sure that:
      - no new table readers are created for iterators on setting ReadOptions.readahead_size
      - At least "readahead" number of bytes are actually getting read on each iterator read.
      
      TODO later:
      - Use similar logic for compactions as well.
      - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
      
      Differential Revision: D15107946
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
      3548e422
  36. 26 Apr, 2019 1 commit
    • Maysam Yabandeh's avatar
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh authored
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448
  37. 23 Apr, 2019 1 commit
    • Andrew Kryczka's avatar
      Optionally wait on bytes_per_sync to smooth I/O (#5183) · 8272a6de
      Andrew Kryczka authored
      Summary:
      The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
      
      My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
      
      Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
      
      There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
      
      The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
      
      Differential Revision: D14953553
      
      Pulled By: siying
      
      fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
      8272a6de
  38. 17 Apr, 2019 1 commit
  39. 12 Apr, 2019 1 commit
    • Siying Dong's avatar
      Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165) · ed9f5e21
      Siying Dong authored
      Summary:
      Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
      Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
      
      Differential Revision: D14880709
      
      Pulled By: siying
      
      fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
      ed9f5e21
  40. 02 Apr, 2019 1 commit
    • Mike Kolupaev's avatar
      Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043) · 120bc471
      Mike Kolupaev authored
      Summary:
      Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.
      
      In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.
      
      (EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
      I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043
      
      Differential Revision: D14339233
      
      Pulled By: al13n321
      
      fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
      120bc471