1. 02 May, 2020 1 commit
  2. 01 May, 2020 1 commit
    • Cheng Chang's avatar
      Make users explicitly be aware of prepare before commit (#6775) · ef0c3eda
      Cheng Chang authored
      In current commit protocol of pessimistic transaction, if the transaction is not prepared before commit, the commit protocol implicitly assumes that the user wants to commit without prepare.
      This PR adds TransactionOptions::skip_prepare, the default value is `true` because if set to `false`, all existing users who commit without prepare need to update their code to set skip_prepare to true. Although this does not force the user to explicitly express their intention of skip_prepare, it at least lets the user be aware of the assumption of being able to commit without prepare.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6775
      Test Plan: added a new unit test TransactionTest::CommitWithoutPrepare
      Reviewed By: lth
      Differential Revision: D21313270
      Pulled By: cheng-chang
      fbshipit-source-id: 3d95b7c9b2d6cdddc09bdd66c561bc4fae8c3251
  3. 30 Apr, 2020 1 commit
  4. 07 Mar, 2020 1 commit
  5. 21 Feb, 2020 1 commit
    • sdong's avatar
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong authored
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      Differential Revision: D19977691
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
  6. 08 Oct, 2019 1 commit
  7. 11 Jun, 2019 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared: reduce prepared_mutex_ overhead (#5420) · c292dc85
      Maysam Yabandeh authored
      The patch reduces the contention over prepared_mutex_ using these techniques:
      1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
      2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
      3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420
      Differential Revision: D15741985
      Pulled By: maysamyabandeh
      fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
  8. 01 Jun, 2019 1 commit
  9. 31 May, 2019 2 commits
  10. 13 Apr, 2019 1 commit
    • Manuel Ung's avatar
      Remove extraneous call to TrackKey (#5173) · d655a3aa
      Manuel Ung authored
      In `PessimisticTransaction::TryLock`, we were calling `TrackKey` even when assume_tracked=true, which defeats the purpose of assume_tracked. Remove this.
      For keys that are already tracked, TrackKey will actually bump some counters (num_reads/num_writes) which are consumed in `TransactionBaseImpl::GetTrackedKeysSinceSavePoint`, and this is used to determine which keys were tracked since the last savepoint. I believe this functionality should still work, since I think the user should not call GetForUpdate/Put(assume_tracked=true) across savepoints, and if they do, they should not expect the Put(assume_tracked=true) to show up as a tracked key in the second savepoint.
      This is another 2-3% cpu improvement.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5173
      Differential Revision: D14883809
      Pulled By: lth
      fbshipit-source-id: 7d09f0772da422384af0519773e310c22b0cbca3
  11. 03 Apr, 2019 1 commit
    • Maysam Yabandeh's avatar
      Mark logs with prepare in PreReleaseCallback (#5121) · 5234fc1b
      Maysam Yabandeh authored
      In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
      Differential Revision: D14665210
      Pulled By: maysamyabandeh
      fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
  12. 15 Feb, 2019 1 commit
    • Michael Liu's avatar
      Apply modernize-use-override (2nd iteration) · ca89ac2b
      Michael Liu authored
      Use C++11’s override and remove virtual where applicable.
      Change are automatically generated.
      Reviewed By: Orvid
      Differential Revision: D14090024
      fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
  13. 16 Jan, 2019 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886) · cad99a60
      Maysam Yabandeh authored
      The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes:
      i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare.
      ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max.
      To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886
      Differential Revision: D13685270
      Pulled By: maysamyabandeh
      fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0
  14. 07 Dec, 2018 1 commit
    • Maysam Yabandeh's avatar
      Extend Transaction::GetForUpdate with do_validate (#4680) · b878f93c
      Maysam Yabandeh authored
      Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false).
      The Java APIs are accordingly updated.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680
      Differential Revision: D13068508
      Pulled By: maysamyabandeh
      fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d
  15. 25 Oct, 2018 1 commit
  16. 11 Sep, 2018 1 commit
    • Maysam Yabandeh's avatar
      Skip concurrency control during recovery of pessimistic txn (#4346) · 3f528226
      Maysam Yabandeh authored
      TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346
      Differential Revision: D9759149
      Pulled By: maysamyabandeh
      fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
  17. 24 Jul, 2018 1 commit
    • Manuel Ung's avatar
      WriteUnPrepared: Implement unprepared batches for transactions (#4104) · ea212e53
      Manuel Ung authored
      This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.
      Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.
      A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104
      Differential Revision: D8785717
      Pulled By: lth
      fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
  18. 04 May, 2018 1 commit
    • Siying Dong's avatar
      Skip deleted WALs during recovery · d5954929
      Siying Dong authored
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      Differential Revision: D7747618
      Pulled By: siying
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
  19. 06 Feb, 2018 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared Txn: Duplicate Keys, Txn Part · 88d8b2a2
      Maysam Yabandeh authored
      This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
      Closes https://github.com/facebook/rocksdb/pull/3455
      Differential Revision: D6873699
      Pulled By: maysamyabandeh
      fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
  20. 01 Dec, 2017 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared Txn: PreReleaseCallback · 18dcf7f9
      Maysam Yabandeh authored
      Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
      These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
      Closes https://github.com/facebook/rocksdb/pull/3205
      Differential Revision: D6438959
      Pulled By: maysamyabandeh
      fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
  21. 12 Nov, 2017 1 commit
  22. 02 Nov, 2017 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared Txn: Optimize for recoverable state · 17731a43
      Maysam Yabandeh authored
      GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
      This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
      Closes https://github.com/facebook/rocksdb/pull/3071
      Differential Revision: D6148023
      Pulled By: maysamyabandeh
      fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
  23. 03 Oct, 2017 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared Txn: Rollback · d27258d3
      Maysam Yabandeh authored
      Implement the rollback of WritePrepared txns. For each modified value, it reads the value before the txn and write it back. This would cancel out the effect of transaction. It also remove the rolled back txn from prepared heap.
      Closes https://github.com/facebook/rocksdb/pull/2946
      Differential Revision: D5937575
      Pulled By: maysamyabandeh
      fbshipit-source-id: a6d3c47f44db3729f44b287a80f97d08dc4e888d
  24. 14 Sep, 2017 1 commit
    • Maysam Yabandeh's avatar
      WritePrepared Txn: Lock-free CommitMap · 09713a64
      Maysam Yabandeh authored
      We had two proposals for lock-free commit maps. This patch implements the latter one that was simpler. We can later experiment with both proposals.
      In this impl each entry is an std::atomic of uint64_t, which are accessed via memory_order_acquire/release. In x86_64 arch this is compiled to simple reads and writes from memory.
      Closes https://github.com/facebook/rocksdb/pull/2861
      Differential Revision: D5800724
      Pulled By: maysamyabandeh
      fbshipit-source-id: 41abae9a4a5df050a8eb696c43de11c2770afdda
  25. 09 Sep, 2017 1 commit
  26. 17 Aug, 2017 1 commit
    • Maysam Yabandeh's avatar
      Update WritePrepared with the pseudo code · eb642530
      Maysam Yabandeh authored
      Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped.
      This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel.
      Closes https://github.com/facebook/rocksdb/pull/2713
      Differential Revision: D5640021
      Pulled By: maysamyabandeh
      fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c
  27. 08 Aug, 2017 1 commit
    • Maysam Yabandeh's avatar
      Refactor PessimisticTransaction · bdc056f8
      Maysam Yabandeh authored
      This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
      Closes https://github.com/facebook/rocksdb/pull/2691
      Differential Revision: D5569464
      Pulled By: maysamyabandeh
      fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
  28. 06 Aug, 2017 1 commit
  29. 03 Aug, 2017 1 commit
    • Maysam Yabandeh's avatar
      Refactor TransactionImpl · c3d5c4d3
      Maysam Yabandeh authored
      This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
      Closes https://github.com/facebook/rocksdb/pull/2676
      Differential Revision: D5549998
      Pulled By: maysamyabandeh
      fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
  30. 29 Jul, 2017 1 commit
    • Siying Dong's avatar
      Replace dynamic_cast<> · 21696ba5
      Siying Dong authored
      Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
      Some nontrivial changes:
      1. Add Comparator::GetRootComparator() to get around the internal comparator hack
      2. Add the two experiemental functions to DB
      3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
      4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
      Closes https://github.com/facebook/rocksdb/pull/2645
      Differential Revision: D5502723
      Pulled By: siying
      fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
  31. 22 Jul, 2017 2 commits
  32. 16 Jul, 2017 1 commit
  33. 25 Jun, 2017 1 commit
    • Maysam Yabandeh's avatar
      Optimize for serial commits in 2PC · 499ebb3a
      Maysam Yabandeh authored
      Throughput: 46k tps in our sysbench settings (filling the details later)
      The idea is to have the simplest change that gives us a reasonable boost
      in 2PC throughput.
      Major design changes:
      1. The WAL file internal buffer is not flushed after each write. Instead
      it is flushed before critical operations (WAL copy via fs) or when
      FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
      via mutex_.
      2. Use two sequence numbers: last seq, and last seq for write. Last seq
      is the last visible sequence number for reads. Last seq for write is the
      next sequence number that should be used to write to WAL/memtable. This
      allows to have a memtable write be in parallel to WAL writes.
      3. BatchGroup is not used for writes. This means that we can have
      parallel writers which changes a major assumption in the code base. To
      accommodate for that i) allow only 1 WriteImpl that intends to write to
      memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
      come via group commit phase which is serial anyway, ii) make all the
      parts in the code base that assumed to be the only writer (via
      EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
      protected via a stat_mutex_.
      Note: the first commit has the approach figured out but is not clean.
      Submitting the PR anyway to get the early feedback on the approach. If
      we are ok with the approach I will go ahead with this updates:
      0) Rebase with Yi's pipelining changes
      1) Currently batching is disabled by default to make sure that it will be
      consistent with all unit tests. Will make this optional via a config.
      2) A couple of unit tests are disabled. They need to be updated with the
      serial commit of 2PC taken into account.
      3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
      releasing mutex_ beforehand (the same way EnterUnbatched does). This
      needs to be cleaned up.
      Closes https://github.com/facebook/rocksdb/pull/2345
      Differential Revision: D5210732
      Pulled By: maysamyabandeh
      fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
  34. 28 Apr, 2017 1 commit
  35. 11 Apr, 2017 2 commits
    • Manuel Ung's avatar
      Fix shared lock upgrades · 9300ef54
      Manuel Ung authored
      Upgrading a shared lock was silently succeeding because the actual locking code was skipped. This is because if the keys are tracked, it is assumed that they are already locked and do not require locking. Fix this by recording in tracked keys whether the key was locked exclusively or not.
      Note that lock downgrades are impossible, which is the behaviour we expect.
      This fixes facebook/mysql-5.6#587.
      Closes https://github.com/facebook/rocksdb/pull/2122
      Differential Revision: D4861489
      Pulled By: IslamAbdelRahman
      fbshipit-source-id: 58c7ebe7af098bf01b9774b666d3e9867747d8fd
    • Manuel Ung's avatar
      Limit maximum memory used in the WriteBatch representation · 1f8b119e
      Manuel Ung authored
      Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit.
      Closes https://github.com/facebook/rocksdb/pull/2124
      Differential Revision: D4861842
      Pulled By: lth
      fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161
  36. 06 Dec, 2016 1 commit
    • Manuel Ung's avatar
      Implement non-exclusive locks · 2005c88a
      Manuel Ung authored
      This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not.
      Some implementation notes:
      - Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`.
      - Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit.
      - Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now.
      Closes https://github.com/facebook/rocksdb/pull/1573
      Differential Revision: D4239097
      Pulled By: lth
      fbshipit-source-id: da7c074
  37. 20 Oct, 2016 1 commit
    • Manuel Ung's avatar
      Implement deadlock detection · 4edd39fd
      Manuel Ung authored
      Summary: Implement deadlock detection. This is done by maintaining a TxnID -> TxnID map which represents the edges in the wait for graph (this is named `wait_txn_map_`).
      Test Plan: transaction_test
      Reviewers: IslamAbdelRahman, sdong
      Reviewed By: sdong
      Subscribers: andrewkr, dhruba
      Differential Revision: https://reviews.facebook.net/D64491