How we scaled SQL

Time-series workloads are different.

Time-series data is largely immutable. Writes primarily occur as new appends to recent time intervals, not as updates to existing rows. Both read and write workloads have a natural partitioning across both time and space.

TimescaleDB provides the illusion of a single continuous database table across all time and space, even though this table is split into many chunks across servers. One makes standard SQL queries against this hypertable, and the TimescaleDB distributed query planner automatically optimizes the query to access just the right set of chunks. A database can have multiple hypertables, each with its own schema and partitioning.

The database scales by partitioning hypertables in two dimensions: by time interval, and by a “partitioning key” over some primary index for the data (e.g., device identifiers for sensor data, locations, customers or users, etc.). Each time/space partition is called a chunk, which is created and placed on a server automatically by the system.

Chunks are dynamically created by the runtime and sized to optimize performance in both cluster and single-node environments. When run as a cluster, chunks are placed on different servers, particularly so that inserts to recent time intervals are parallelized across many servers. Similarly, query patterns often slice across time or space, so also enjoy performance improvements through smart chunk placement.

Even in single-node settings, chunking still improves performance over the vanilla use of standard database tables. Right-sized chunks ensure that the multiple B-trees for a table’s indexes can reside in memory during inserts to avoid thrashing. Further, they help avoid expensive “vacuuming” operations when removing deleted data according to automated retention policies, as the runtime can perform such operations by simple dropping chunks (internal tables), rather than deleting individual rows.

TimescaleDB integrates directly into the Postgres parse tree and query planner, and performs low-level distributed query optimizations to improve performance. These include minimizing the number of chunks accessed when answering queries, performing partial aggregations across many servers, optimizing queries with time-based GROUP BY, ORDER BY, and LIMIT clauses, and so forth.

Common queries in settings like device monitoring might ask questions like “give me the last reading for each device.” In many databases, this query would require a full table scan back to the beginning of time to capture “for every device.” Whenever a table schema defines a column as comprised of distinct items, the TimescaleDB runtime automatically tracks additional metadata about items to optimize such queries.

TimescaleDB allows JOINs between time-series and business data. This allows direct queries based on metadata stored in separate relational tables, for example, rather than requiring this metadata to either be denormalized at insert time (which is very expensive to both store and update), or by awkwardly performing such joins in application code between databases. TimescaleDB also supports (and uses) foreign data wrappers as a standard SQL interface to external databases.

TimescaleDB can be managed like a traditional Postgres database, and can leverage Postgres' rich ecosystem for tooling, backups and snapshotting, replication, log shipping, and other processes. TimescaleDB supports multi-tenant operation, with many distinct projects and users running in one deployment. It also supports highly flexible access control, providing natural ways to restrict users to different types of data (e.g., raw granular data, aggregates, derived), or different time intervals.