The problem with log structured storage segment is to keep hash index in the memory. For large data, it is difficult task.

Solution is to reduce the index map size.
SSTable come with unique solution.

SSTable : Sorted String Table

Keep the key in sorted order
Each key only appears once within each merged segment file (the compaction process already ensures that. Refer here
SSTables have several big advantages over log segments with hash index.

SSTable Segments:

SSTable segments are easy to merge, as keys are sorted.
The approach is similar to merge sort algorithm:

You start reading the input files side by side , look at the first key in each file, copy the lowest key (according to the sorted order) to the output file and repeat.This produces a new merged segment file, also sorted by key.
When multiple segments contains the same key, we can keep the value from the recent segment and discard the values in older segment.

Find a particular key in the file:

In order to find a particular key in the file, you no longer need to keep an index of all the key in memory.
Say, you are looking for the key "handiwork", however offset for this key is not available. You have the offset for keys "handbag" and "handsome". So you know that "handiwork" must appear between these two keys. This means, you can jump to the offset for "handbag" and scan for these until you find "handiwork" (or not, if key is not present in the table).
You still need an in-memory index to tell you the offset for some of the keys, but it would be spars. One key for every few kilobytes of segment file is sufficient.

Group the Segments into blocks:

Since read requests need to scan over several key value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk.
Each entry of the sparse in memory index then points at the start of a compressed block.
Besides saving disk space, compression also reduce the I/O bandwidth use.

Constructing and Maintaining SSTable:

How to maintain a sorted structure in memory:

We can use red-black tree, AVL tree etc.. (Self balancing Tree),
With these data structures, you can insert keys in any order and read them back in sorted order.

Storage Engine works as follows:

When a write request comes in, add it to an in-memory balanced tree data structure.

This in-memory tree is called memtable.

When the memtable gets bigger than some threshold (few megabytes) write it out to a disk as a SSTable file.

This can be done efficiently because the tree already maintains the keys in sorted order.
The new SSTable file becomes the most recent segment of the database.
While the SSTable is being written out to disk write reqiest can continue to a new memtable instance.

In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next older segment.
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.
Problem: If the database crashes, the most recent writes on memtable are lost.

In order to avoid this problem, we can keep a separate log on disk to which every write is immediately appended. The log is not necessary to be in sorted order, because it is only purpose to restore the memtable after a crash.
Every time the memtable is writen out to an SSTable, the corresponding log can be discarded. [Used in LevelDB, RockDB, Cassandra, HBase]

LSM Tree: Log Structured Merge Tree:

Storage engine that are based on this principle of merging and compacting sorted files are often called LSM Tree storage engine.
Lucene, an indexing engine for full text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary.
A full text index is much moe complex then a key-value index but is based on a similar idea; given a word in search query, find all documents that mentioned the word.
This is implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contains the word(the posting list).
In Lucene, this mapping from term to posting list is kept in SSTable liked sorted files, which are merged in the background as needed.

Performance Optimisations:

The LSM Tree algorithm can be slow when looking up key that do not exist in the database.
You have to check the memtable, then the segments all the way back to oldest before you can be sure that the key does not exist.
In order to optimize, this kind of searches, storage engine often use additional Bloom Filter. Bloom filter can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for non-existent keys.
There are different strategies to determine the order and timing of how SSTable are compacted and merged. The most common options are Size-tiered and leveled compaction.

LevelDB and RockDB use leveled compaction
HBase uses size-tiered
Cassandra supports both leveled and size-tiered

Size-Tiered compaction:

In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables.

Levled Compaction:

In leveled compaction, the key range is split up into smaller SSTables. Older data is moved into separate "levels", which allows the compaction to proceed more incremently and use less disk space.

The basic idea of LSM Tree is to keeping a cascade of SSTable that are merged in the background, is simple and effective.
Even the dataset is much bigger than the available memory it continues to work well.
Since data is stored in sorted order, you can efficiently perform range queries (scaning all keys in the given range)
Also as the disk writes are sequential the LSM-Tree can support remarkably high write throughput.

Downside of LSM Tree:

The compaction process sometimes interfere with the performance of ongoing reads and write.
Another issue with compaction arises at high write throughput: Disk's write bandwidth needs to be shared between write (logging and flushing memtable to disk) and the compaction threads running in the background.
If compaction is not configured carefully, it can happen that compaction can not keep up with the rate of incoming writes. Hence,

number of unmanaged segments on disk keep growing until you run out of disk space.
read will also slow down because they need to check more segment files.

So, you need explicit monitoring to detect this situation.

Next is Concept of B-Tree

Search This Blog

System Design

SSTables and LSM Trees

SSTable : Sorted String Table

SSTable Segments:

Constructing and Maintaining SSTable:

Storage Engine works as follows:

LSM Tree: Log Structured Merge Tree:

Performance Optimisations:

Downside of LSM Tree:

Comments

Post a Comment

Popular posts from this blog

Distributed Lock with Redlock

Distributed Transaction

Storage Engine