Core concepts¶

Currently SQLTrack supports PostgreSQL through the psycopg3 driver, as well as SQLite for local use.

We decided against using an ORM like SQLAlchemy, as it would add layers of indirection between our users and their data. Ideally we would use standard SQL and let users bring their own Python DB-API 2.0 compatible driver, but that would mean we lose access to advanced features like indexable JSONB columns.

Experiments¶

Experiments are the top level concept in SQLTrack. They are stored in the experiments table. It has the following columns:

`id`	unique ID	`int`
`name`	unique name	`str`
`comment`	free text comment	`str`
`tags`	tags, stored as a dict `{"tag": True}`	`dict` (JSONB)
`extras`	extra bits of data that is not tags	`dict` (JSONB)

Tags are stored as a dict, where keys are the text tags and all values are True. This definition makes it easy to query, add, and remove tags. You can store anything that isn’t a tag as an extra, though it must be JSON serializable.

Experiment links¶

SQLTrack supports adding links between experiments, e.g., to express a predecessor/successor relationships. Experiment links are stored in the experiment_links table. It has the following columns:

`from_id`	ID of the origin experiment	`int`
`kind`	what kind of link this is	`str`
`to_id`	ID of the target experiment	`int`

Runs¶

Runs are individual, tracked executions of some code, like a job on an HPC cluster. Each run belongs to one experiment and stores metadata about the exeuction. Runs are stored in the runs table. It has the following columns:

`id`	unique ID	`int`
`experiment_id`	ID of the experiment this run belongs to	`int`
`status`	text status of the run, defaults to `"PLANNED"`	`str`
`time_created`	Creation timestamp, must not be null	`datetime`
`time_started`	Start timestamp	`datetime`
`time_updated`	Last update timestamp	`datetime`
`comment`	free text comment	`str`
`tags`	tags, stored as a dict `{"tag": True}`	`dict` (JSONB)
`args`	arguments for this run	`dict` (JSONB)
`env`	environment variables for this run	`dict` (JSONB)
`extras`	extra bits of data that don’t fit anywhere else	`dict` (JSONB)

Run links¶

Like links between experiments, SQLTrack supports adding links between runs. Common uses could be to express that one run resumes or repeats another run which failed, or a pre-training/fine-tuning relationship. Run links are stored in the run_links table. It has the following columns:

`from_id`	ID of the experiment this run belongs to	`int`
`kind`	what kind of link this is	`str`
`to_id`	ID of the target run	`int`

Metrics¶

Metrics are measurements taken during run execution. They are stored in the metrics table. It has the following columns:

`run_id`	ID of the run these metrics belong to	`int`
`step`	Integer step of the experiment	`int`
`progress`	Run progress expressed as a real number	`float`

Importantly, metrics measured at the same point in time are stored together in the same row. You must set step or progress (or both) to define this point. By convention, step refers to an absolute value like iterations, and progress refers to a completion percentage.

Defining metrics¶

After initializing the database with the base schema, the metrics table doesn’t contain any columns to store metrics yet. We recommend users add the required columns before starting experiments, however, run_add_metrics() (and Run.add_metrics()) will try to infer types and attempt to add columns if necessary. To manually define metrics, create a script to add columns. A script that defines columns for timing, loss, and accuracy in train, validation, and test phases could look like this:

BEGIN;

ALTER TABLE metrics ADD COLUMN train_start TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics	ADD COLUMN train_end TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics	ADD COLUMN train_loss FLOAT;
ALTER TABLE metrics	ADD COLUMN train_top1 FLOAT;
ALTER TABLE metrics	ADD COLUMN train_top5 FLOAT;
ALTER TABLE metrics	ADD COLUMN val_start TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics	ADD COLUMN val_end TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics	ADD COLUMN val_loss FLOAT;
ALTER TABLE metrics	ADD COLUMN val_top1 FLOAT;
ALTER TABLE metrics	ADD COLUMN val_top5 FLOAT;
ALTER TABLE metrics	ADD COLUMN test_start TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics	ADD COLUMN test_end TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics	ADD COLUMN test_loss FLOAT;
ALTER TABLE metrics	ADD COLUMN test_top1 FLOAT;
ALTER TABLE metrics	ADD COLUMN test_top5 FLOAT;

END;

Call this script v001.sql or similar, and use it with sqltrack setup:

sqltrack setup v001.sql

If you want to make changes later, simply create a v002.sql and run setup again.

Automatic types with PostgreSQL¶

With the PostgreSQL engine, SQLTrack follows the adaptation strategy of psycopg. This provides seamless translation for commonly used built-in types like int, float, str etc.

Automatic types with SQLite¶

With SQLite, the following types are inferred:

int: INTEGER
float: REAL
str: TEXT
bytes: BLOB
datetime.date: DATE
datetime.datetime: TIMESTAMP WITH TIME ZONE
datetime.timedelta: INTERVAL
Jsonb: JSONB

Is one column per metric not inefficient?¶

You might ask why you should use columns for your metrics, because that seems annoying and wasteful compared to a normalized name+value approach like what MLflow uses (one row per value with run ID, metric name, and timestamp). But don’t worry, because PostgreSQL is smart and doesn’t actually store NULL values. It only stores values that are not NULL and uses a bitmap per row to keep track of them.

By contrast, each row has a fixed size header of ~23 bytes and MLflow uses one row per metric value. Since we store many metric values in a row we can afford really large bitmaps to track those NULL values before we come out worse. Even if you only ever store one metric per row, the break even point is over 200 columns.

Full schema¶

If you wish to dive deeper into the inner workings of SQLTrack, you can find the full base schema for PostgreSQL at sqltrack/engines/postgres/base.sql, and for SQLite at sqltrack/engines/sqlite/base.sql. They contain the definitions of the corresponding tables experiments, experiment_links, runs, run_links, and metrics, as well as indices and some additional definitions.