Prometheus >= 2.0 uses a new storage engine which dramatically increases scalability
Ingested samples are grouped in blocks of two hours
Those 2h samples are stored in separate directories (in the data directory of Prometheus)
Writes are batched and written to disk in chunks, containing multiple data points
Every directory also has an index file (index) and a metadata file (meta.json)
It stores the metric names and the labels, and provides an index form the metric names and labels to the series in the chunk files
The most recent data is kept in memory
You don't want to loose the in-memory data during a crash, so the data also needs to be persisted to disk. This is done using a write-ahead-log (WAL)
Write Ahead Log (WAL)
it's quicker to append to a file (like a log) than making(multiple) random read/writes
If there's a server crash and the data from memory is lost, then the WAL will be replayed
This way no data will to lost or corrupted during a crash
When series gets deleted, a tombstone file gets create
The initial 2-hour blocks are merged in the background to from longer blocks
This is called compaction
The horizontal partitioning gives a lot of benefits:
When querying, the blocks not in the time range can be skipped
When completing a block, data only needs to be added, and not modified (avoids write-amplification)
Recent data is kept in memory, so can be queried quicker
Deleting old data is only a matter of deleting directories on the filesystem.
Compaction:
When querying, blocks have to be merged together to be able to calculate the results
Too many blocks could cause too much merging overhead, so blocks are compacted
2 blocks are merged and form a newly created (often larger) block
Compaction can also modify data:
dropping deleted data or restructuring the chunks to increase the query performance
The index:
Having horizontal partitioning already makes most queries quicker, but not those that need to go through all the data to get the result
The index is an inverted index to provide better query performance, also in cases where all data needs to be queried
Each series is assigned a unique ID (e.g. ID 1,2 and 3)
The index will contain an inverted index for the labels, for example for label env=production, it'll have 1 and 3 as IDs if those series contain the label env=production
What about Disk size?
On average, Prometheus needs 1-2 bytes per sample
You can use the following formula to calculate the disk space needed: