Skip to content

LSM-based key-value store in Go for educational purpose.

License

Notifications You must be signed in to change notification settings

SarthakMakhija/go-lsm

Repository files navigation

go-lsm

Build

LSM-based key-value store in Go for educational purpose.

Rewrite of the existing workshop code.

Inspired by LSM in a Week.

Exploring LSM with go-lsm

  • Learn LSM from the ground up: Dive deep into the core concepts of Log-Structured Merge-Trees (LSM) through a practical, well-documented implementation.
  • Benefit from clean code: Analyze a meticulously crafted codebase that prioritizes simplicity and readability.
  • Gain confidence with robust tests: Verify the correctness and reliability of the storage engine through comprehensive tests.
  • Experiment and extend: Customize the code to explore different LSM variations or integrate it into your own projects.

Building blocks of LSM-based key-value storage engine

  1. Memtable is an in-memory data structure which holds versioned key and value pairs. Every transactional write gets stored in a Memtable which uses Skiplist as its data structure. The Skiplist implementation in this repository is shamelessly taken from Badger. It is a lock-free implementation of Skiplist. It is important to have a lock-free implementation, otherwise scan operation will take lock(s) (/read-locks) and it will start interfering with write operations. Check Memtable.

  2. WAL stands for write-ahead log. Every transactional write gets stored in a Memtable which is backed by a WAL. Every write to Memtable (typically a TimestampedBatch) involves writing every key/value pair from the batch to WAL. The implementation in this repository writes every key/value pair from the batch to WAL individually. An alternate would be to serialize the entire TimestampedBatch and write to WAL. Check WAL.

  3. Recovery of Memtable from WAL involves the following:

    1. Opening the WAL file in READONLY mode.
    2. Reading the whole file in one go.
    3. Iterating through the file buffer (/bytes) and decoding the bytes to get key and value pairs.
    4. Storing the key/value pairs in the Memtable.

    Check recovery of Memtable from WAL.

  4. Manifest records different events in the system. This implementation supports MemtableCreatedEventType, SSTableFlushedEventType and CompactionDoneEventType event types. This concept is used in recovering the state of the LSM (StorageState) when it restarts. Check Manifest.

  5. SSTable stands for sorted string table. It is the on-disk representation of the data. An SSTable contains the data sorted by key. SSTables can be created by flushing an immutable Memtable or by merging SSTables (/compaction). An SSTable needs to be encoded, the encoding of SSTable in this repository is available here. Check SSTable.

  6. Bloom filter is a probabilistic data structure used to test whether an element maybe present in the dataset. A bloom filter can query against large amounts of data and return either “possibly in the set” or “definitely not in the set”. It depends on M-sized bit vector and K-hash functions. It is used to check if the application should read an SSTable during a get operation. The Bloom filter acts as a first check for a key. If it says the key might be present (returns "maybe"), then the system checks the SSTable for confirmation. Check Bloom filter.

  7. Transaction represents an atomic unit of work. This repository implements various concepts to implement ACID properties:

A brief over of serialized-snapshot-isolation:

  1. Every transaction is given a begin-timestamp. Timestamp is represented as a logical clock.
  2. A transaction can read a key with a commit-timestamp < begin-timestamp. This guarantees that the transaction always reads committed data.
  3. When a transaction is ready to commit, and there are no conflicts, it is given a commit-timestamp.
  4. ReadWrite transactions keep a track of the keys read by them. Implementations like Badger keep track of key-hashes inside ReadWrite transactions.
  5. Two transactions conflict if there is a read-write conflict. A transaction T2 conflicts with another transaction T1, if, T1 has committed to any of the keys read by T2 with a commit-timestamp greater than the begin-timestamp of T2.
  6. Readonly transactions never abort.
  7. Serialized-snapshot-isolation prevents: dirty-read, fuzzy-read, phantom-read, write-skew and lost-update.

More details are available here. Start understanding Transaction.

  1. Compaction implementation in this repository is a simpled-leveled compaction. Simple-leveled compaction considers two options for deciding if compaction needs to run.

    • Option1: Level0FilesCompactionTrigger. This defines the number of SSTable files at level0 which should trigger compaction. Consider Level0FilesCompactionTrigger = 2, and number of SSTable files at level0 = 3. This means all SSTable files present at level0 are eligible for undergoing compaction with all the SSTable files at level1.

    • Option2: NumberOfSSTablesRatioPercentage. This defines the ratio between the number of SSTable files present in two adjacent levels: number of files at lower level / number of files at upper level. Consider NumberOfSSTablesRatioPercentage = 200, and number of SSTable files at level1 = 2, and at level2 = 1. Ratio = (1/2)*100 = 50%. This is less than the configured NumberOfSSTablesRatioPercentage. Hence, SSTable files will undergo compaction between level1 and level2. This typically means that the number of files in lower level(s) should be more than the number of files in upper level(s).

The actual implementation of simple-leveled compaction considers file size instead of number of files. Check Compaction.

  1. Iterators form one of the core building blocks of LSM based key/value storage engine. Iterators are used in operations like Scan and Compaction. This repository provides various iterators, (listing a few here): MergeIterator, SSTableIterator and InclusiveBoundedIterator.

  2. Client API provides a user interface for interacting with the key/value storage engine. It's important to note that the API itself isn't considered a fundamental building block of the engine. However, it functions as the primary access point for clients to perform various operations on the stored key/value data. Check Db.

Please note: this repository does not implement block-cache and CRC.

Development items

LSM development items

Releases

No releases published

Packages

No packages published

Languages