1. 05 Sep, 2019 2 commits
    • Affan Dar's avatar
      Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765) · 229e6fbe
      Affan Dar authored
      Summary:
      Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.
      
      Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.
      
      Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765
      
      Differential Revision: D17172717
      
      Pulled By: affandar
      
      fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
      229e6fbe
    • Yanqin Jin's avatar
      Replace named comparator struct with lambda (#5768) · 38b17ecd
      Yanqin Jin authored
      Summary:
      Tiny code mod: replace a named comparator struct with anonymous lambda.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5768
      
      Differential Revision: D17185141
      
      Pulled By: riversand963
      
      fbshipit-source-id: fabe367649931c33a39ad035dc707d2efc3ad5fc
      38b17ecd
  2. 04 Sep, 2019 1 commit
  3. 03 Sep, 2019 1 commit
    • Vijay Nadimpalli's avatar
      Persistent globally unique DB ID in manifest (#5725) · 979fbdc6
      Vijay Nadimpalli authored
      Summary:
      Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
      The goal of this PR is solve this problem by moving the  DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725
      
      Test Plan: Added unit tests.
      
      Differential Revision: D16963840
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
      979fbdc6
  4. 31 Aug, 2019 2 commits
    • Yanqin Jin's avatar
      Fix a bug in file ingestion (#5760) · 44eca41a
      Yanqin Jin authored
      Summary:
      Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job.
      Also skip deleting non-existing hard links during cleanup-after-failure.
      
      Test plan (devserver)
      ```
      $COMPILE_WITH_ASAN=1 make all
      $./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_*/*
      $makke check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760
      
      Differential Revision: D17142982
      
      Pulled By: riversand963
      
      fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6
      44eca41a
    • Yanqin Jin's avatar
      Fix assertion failure in FIFO compaction with TTL (#5754) · 672befea
      Yanqin Jin authored
      Summary:
      Before this PR, the following sequence of events can cause assertion failure as shown below.
      Stack trace (partial):
      ```
      (gdb) bt
      2  0x00007f59b350ad15 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:92
      3  0x00007f59b350adc3 in __GI___assert_fail (assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:101
      4  0x0000000000492ccd in rocksdb::Compaction::MarkFilesBeingCompacted (this=<optimized out>, mark_as_compacted=<optimized out>) at db/compaction/compaction.cc:394
      5  0x000000000049467a in rocksdb::Compaction::Compaction (this=0x7f59af013000, vstorage=0x7f581af53030, _immutable_cf_options=..., _mutable_cf_options=..., _inputs=..., _output_level=<optimized out>, _target_file_size=0, _max_compaction_bytes=0, _output_path_id=0, _compression=<incomplete type>, _compression_opts=..., _max_subcompactions=0, _grandparents=..., _manual_compaction=false, _score=4, _deletion_compaction=true, _compaction_reason=rocksdb::CompactionReason::kFIFOTtl) at db/compaction/compaction.cc:241
      6  0x00000000004af9bc in rocksdb::FIFOCompactionPicker::PickTTLCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:101
      7  0x00000000004b0771 in rocksdb::FIFOCompactionPicker::PickCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:201
      8  0x00000000004838cc in rocksdb::ColumnFamilyData::PickCompaction (this=this@entry=0x7f59b31b3700, mutable_options=..., log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/column_family.cc:933
      9  0x00000000004f3645 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x7f59b3176000, made_progress=made_progress@entry=0x7f59b1bfa6bf, job_context=job_context@entry=0x7f59b1bfa760, log_buffer=log_buffer@entry=0x7f59b1bfa930, prepicked_compaction=prepicked_compaction@entry=0x0, thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2541
      10 0x00000000004f5e2a in rocksdb::DBImpl::BackgroundCallCompaction (this=this@entry=0x7f59b3176000, prepicked_compaction=prepicked_compaction@entry=0x0, bg_thread_pri=bg_thread_pri@entry=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2312
      11 0x00000000004f648e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2087
      ```
      This can be caused by the following sequence of events.
      ```
      Time
      |      thr          bg_compact_thr1                     bg_compact_thr2
      |      write
      |      flush
      |                   mark all l0 as being compacted
      |      write
      |      flush
      |                   add cf to queue again
      |                                                       mark all l0 as being
      |                                                       compacted, fail the
      |                                                       assertion
      V
      ```
      Test plan (on devserver)
      Since bg_compact_thr1 and bg_compact_thr2 are two threads executing the same
      code, it is difficult to use sync point dependency to
      coordinate their execution. Therefore, I choose to use db_stress.
      ```
      $TEST_TMPDIR=/dev/shm/rocksdb ./db_stress --periodic_compaction_seconds=1 --max_background_compactions=20 --format_version=2 --memtablerep=skip_list --max_write_buffer_number=3 --cache_index_and_filter_blocks=1 --reopen=20 --recycle_log_file_num=0 --acquire_snapshot_one_in=10000 --delpercent=4 --log2_keys_per_lock=22 --compaction_ttl=1 --block_size=16384 --use_multiget=1 --compact_files_one_in=1000000 --target_file_size_multiplier=2 --clear_column_family_one_in=0 --max_bytes_for_level_base=10485760 --use_full_merge_v1=1 --target_file_size_base=2097152 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=zstd --writepercent=35 --readpercent=45 --subcompactions=4 --use_merge=0 --write_buffer_size=4194304 --test_batches_snapshots=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --use_direct_reads=0 --compact_range_one_in=1000000 --open_files=-1 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=0 --nooverwritepercent=1 --compression_max_dict_bytes=0 --max_key=1000000 --prefixpercent=5 --flush_one_in=1000000 --ops_per_thread=40000 --index_block_restart_interval=7 --cache_size=1048576 --compaction_style=2 --verify_checksum=1 --delrangepercent=1 --use_direct_io_for_flush_and_compaction=0
      ```
      This should see no assertion failure.
      Last but not least,
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5754
      
      Differential Revision: D17109791
      
      Pulled By: riversand963
      
      fbshipit-source-id: 25fc46101235add158554e096540b72c324be078
      672befea
  5. 30 Aug, 2019 5 commits
  6. 29 Aug, 2019 1 commit
    • anand76's avatar
      Support row cache with batched MultiGet (#5706) · e1057033
      anand76 authored
      Summary:
      This PR adds support for row cache in ```rocksdb::TableCache::MultiGet```.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5706
      
      Test Plan:
      1. Unit tests in db_basic_test
      2. db_bench results with batch size of 2 (```Get``` is faster than ```MultiGet``` for single key) -
      Get -
      readrandom   :       3.935 micros/op 254116 ops/sec;   28.1 MB/s (22870998 of 22870999 found)
      MultiGet -
      multireadrandom :       3.743 micros/op 267190 ops/sec; (24047998 of 24047998 found)
      
      Command used -
      TEST_TMPDIR=/dev/shm/multiget numactl -C 10  ./db_bench -use_existing_db=true -use_existing_keys=false -benchmarks="readtorowcache,[read|multiread]random" -write_buffer_size=16777216 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -row_cache_size=4194304000 -batch_size=2 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=131072
      
      Differential Revision: D17086297
      
      Pulled By: anand1976
      
      fbshipit-source-id: 85784378da913e05f1baf31ec1b4e7c9345e7f57
      e1057033
  7. 28 Aug, 2019 2 commits
  8. 27 Aug, 2019 3 commits
  9. 24 Aug, 2019 2 commits
    • Zhongyi Xie's avatar
      Refactor trimming logic for immutable memtables (#5022) · 2f41ecfe
      Zhongyi Xie authored
      Summary:
      MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
      We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
      The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
      In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
      
      Differential Revision: D14394062
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
      2f41ecfe
    • DaiZhiwei's avatar
      crc32c_arm64 performance optimization (#5675) · 26293c89
      DaiZhiwei authored
      Summary:
      Crc32c Parallel computation coding optimization:
      Macro unfolding removes the "for" loop and is good to decrease branch-miss in arm64 micro architecture
      1024 Bytes is divided into  8(head) + 1008( 6 * 7 * 3 * 8 ) + 8(tail)  three parts
      Macro unfolding 42 loops to 6 CRC32C7X24BYTESs
      1 CRC32C7X24BYTES containing 7 CRC32C24BYTESs
      
      1, crc32c_test
      [==========] Running 4 tests from 1 test case.
      [----------] Global test environment set-up.
      [----------] 4 tests from CRC
      [ RUN      ] CRC.StandardResults
      [       OK ] CRC.StandardResults (1 ms)
      [ RUN      ] CRC.Values
      [       OK ] CRC.Values (0 ms)
      [ RUN      ] CRC.Extend
      [       OK ] CRC.Extend (0 ms)
      [ RUN      ] CRC.Mask
      [       OK ] CRC.Mask (0 ms)
      [----------] 4 tests from CRC (1 ms total)
      
      [----------] Global test environment tear-down
      [==========] 4 tests from 1 test case ran. (1 ms total)
      [  PASSED  ] 4 tests.
      
      2, db_bench --benchmarks="crc32c"
      crc32c : 0.218 micros/op 4595390 ops/sec; 17950.7 MB/s (4096 per op)
      
      3, repeated crc32c_test case  60000 times
      perf stat -e branch-miss -- ./crc32c_test
      before optimization:
      739,426,504      branch-miss
      after optimization:
      1,128,572      branch-miss
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5675
      
      Differential Revision: D16989210
      
      fbshipit-source-id: 7204e6069bb6ed066d49c2d1b3ac385065a98557
      26293c89
  10. 23 Aug, 2019 3 commits
    • Levi Tamasi's avatar
      Revert to storing UncompressionDicts in the cache (#5645) · df8c307d
      Levi Tamasi authored
      Summary:
      PR https://github.com/facebook/rocksdb/issues/5584 decoupled the uncompression dictionary object from the underlying block data; however, this defeats the purpose of the digested ZSTD dictionary, since the whole point
      of the digest is to create it once and reuse it over and over again. This patch goes back to
      storing the uncompression dictionary itself in the cache (which should be now safe to do,
      since it no longer includes a Statistics pointer), while preserving the rest of the refactoring.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5645
      
      Test Plan: make asan_check
      
      Differential Revision: D16551864
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 2a7e2d34bb16e70e3c816506d5afe1d842057800
      df8c307d
    • sdong's avatar
      Atomic Flush Crash Test also covers the case that WAL is enabled. (#5729) · d8a27d93
      sdong authored
      Summary:
      AtomicFlushStressTest is a powerful test, but right now we only run it for atomic_flush=true + disable_wal=true. We further extend it to the case where atomic_flush=false + disable_wal = false. All the workload generation and validation can stay the same.
      Atomic flush crash test is also changed to switch between the two test scenarios. It makes the name "atomic flush crash test" out of sync from what it really does. We leave it as it is to avoid troubles with continous test set-up.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5729
      
      Test Plan: Run "CRASH_TEST_KILL_ODD=188 TEST_TMPDIR=/dev/shm/ USE_CLANG=1 make whitebox_crash_test_with_atomic_flush", observe the settings used and see it passed.
      
      Differential Revision: D16969791
      
      fbshipit-source-id: 56e37487000ae631e31b0100acd7bdc441c04163
      d8a27d93
    • Patrick Pei's avatar
      Fix local includes · 202942b2
      Patrick Pei authored
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5722
      
      Differential Revision: D16908380
      
      fbshipit-source-id: 6a0e3cb2730b08d6012d3d7f31c937f01c399846
      202942b2
  11. 22 Aug, 2019 2 commits
    • Maysam Yabandeh's avatar
      Refactor MultiGet names in BlockBasedTable (#5726) · 244e6f20
      Maysam Yabandeh authored
      Summary:
      To improve code readability, since RetrieveBlock already calls MaybeReadBlockAndLoadToCache, we avoid name similarity of the functions that call RetrieveBlock with MaybeReadBlockAndLoadToCache. The patch thus renames MaybeLoadBlocksToCache to RetrieveMultipleBlock and deletes GetDataBlockFromCache, which contains only two lines.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5726
      
      Differential Revision: D16962535
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 99e8946808ce4eb7857592b9003812e3004f92d6
      244e6f20
    • anand76's avatar
      Fix MultiGet() bug when whole_key_filtering is disabled (#5665) · 9046bdc5
      anand76 authored
      Summary:
      The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader.
      
      Tests:
      Unit test (confirmed failed before the fix)
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665
      
      Differential Revision: D16902380
      
      Pulled By: anand1976
      
      fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00
      9046bdc5
  12. 21 Aug, 2019 3 commits
  13. 20 Aug, 2019 1 commit
    • sdong's avatar
      Slightly adjust atomic white box test's kill odd (#5717) · 8e12638f
      sdong authored
      Summary:
      Atomic white box test's kill odd is the same as normal test. However, in the scenario that only WritableFileWriter::Append() is blacklisted, WritableFileWriter::Flush() dominates the killing odds. Normally, most of WritableFileWriter::Flush() are called in WAL writes, where every write triggers a WAL flush. In atomic test, WAL is disabled, so the kill happens less frequently than we antipated. In some rare cases, the kill didn't end up with happening (for reasons I still don't fully understand) and cause the stress test timeout.
      
      If WAL is disabled, make the odds 5x likely to trigger.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5717
      
      Test Plan: Run whitebox_crash_test_with_atomic_flush and whitebox_crash_test and observe the kill odds printed out.
      
      Differential Revision: D16897237
      
      fbshipit-source-id: cbf5d96f6fc0e980523d0f1f94bf4e72cdb82d1c
      8e12638f
  14. 17 Aug, 2019 11 commits
  15. 16 Aug, 2019 1 commit
    • sdong's avatar
      Add command "list_file_range_deletes" in ldb (#5615) · bd2c753d
      sdong authored
      Summary:
      Add a command in ldb so that users can print out tombstones in SST files.
      In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
      
      Test Plan: Add a new unit test
      
      Differential Revision: D16550326
      
      fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
      bd2c753d