Core concepts¶
Currently SQLTrack supports PostgreSQL through the psycopg3 driver, as well as
SQLite
for local use.
We decided against using an ORM like SQLAlchemy, as it would add layers of indirection between our users and their data. Ideally we would use standard SQL and let users bring their own Python DB-API 2.0 compatible driver, but that would mean we lose access to advanced features like indexable JSONB columns.
Experiments¶
Experiments are the top level concept in SQLTrack. They are
stored in the experiments
table. It has the following
columns:
|
unique ID |
|
|
unique name |
|
|
free text comment |
|
|
tags, stored as a dict |
|
|
extra bits of data that is not tags |
|
Tags are stored as a dict, where keys are the text tags and all
values are True
. This definition makes it easy to query, add,
and remove tags. You can store anything that isn’t a tag as an
extra, though it must be JSON serializable.
Experiment links¶
SQLTrack supports adding links between experiments, e.g., to
express a predecessor/successor relationships. Experiment links
are stored in the experiment_links
table. It has the
following columns:
|
ID of the origin experiment |
|
|
what kind of link this is |
|
|
ID of the target experiment |
|
Runs¶
Runs are individual, tracked executions of some code, like a job
on an HPC cluster. Each run belongs to one experiment and stores
metadata about the exeuction. Runs are stored in the runs
table. It has the following columns:
|
unique ID |
|
|
ID of the experiment this run belongs to |
|
|
text status of the run, defaults to |
|
|
Creation timestamp, must not be null |
|
|
Start timestamp |
|
|
Last update timestamp |
|
|
free text comment |
|
|
tags, stored as a dict |
|
|
arguments for this run |
|
|
environment variables for this run |
|
|
extra bits of data that don’t fit anywhere else |
|
Run links¶
Like links between experiments, SQLTrack supports adding links
between runs. Common uses could be to express that one run
resumes or repeats another run which failed, or a
pre-training/fine-tuning relationship. Run links are stored in
the run_links
table. It has the following columns:
|
ID of the experiment this run belongs to |
|
|
what kind of link this is |
|
|
ID of the target run |
|
Metrics¶
Metrics are measurements taken during run execution. They are
stored in the metrics
table. It has the following columns:
|
ID of the run these metrics belong to |
|
|
Integer step of the experiment |
|
|
Run progress expressed as a real number |
|
Importantly, metrics measured at the same point in time are
stored together in the same row. You must set step
or
progress
(or both) to define this point. By convention,
step
refers to an absolute value like iterations, and
progress
refers to a completion percentage.
Defining metrics¶
After initializing the database with the base schema, the
metrics
table doesn’t contain any columns to store metrics
yet. We recommend users add the required columns before starting
experiments, however, run_add_metrics()
(and
Run.add_metrics()
) will try to infer types and attempt
to add columns if necessary. To manually define metrics, create a
script to add columns. A script that defines columns for timing,
loss, and accuracy in train, validation, and test phases could
look like this:
BEGIN;
ALTER TABLE metrics ADD COLUMN train_start TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics ADD COLUMN train_end TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics ADD COLUMN train_loss FLOAT;
ALTER TABLE metrics ADD COLUMN train_top1 FLOAT;
ALTER TABLE metrics ADD COLUMN train_top5 FLOAT;
ALTER TABLE metrics ADD COLUMN val_start TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics ADD COLUMN val_end TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics ADD COLUMN val_loss FLOAT;
ALTER TABLE metrics ADD COLUMN val_top1 FLOAT;
ALTER TABLE metrics ADD COLUMN val_top5 FLOAT;
ALTER TABLE metrics ADD COLUMN test_start TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics ADD COLUMN test_end TIMESTAMP WITH TIME ZONE;
ALTER TABLE metrics ADD COLUMN test_loss FLOAT;
ALTER TABLE metrics ADD COLUMN test_top1 FLOAT;
ALTER TABLE metrics ADD COLUMN test_top5 FLOAT;
END;
Call this script v001.sql
or similar, and use it with
sqltrack setup
:
sqltrack setup v001.sql
If you want to make changes later, simply create a v002.sql
and run setup again.
Automatic types with PostgreSQL¶
With the PostgreSQL engine, SQLTrack follows the adaptation
strategy of
psycopg. This provides seamless translation for commonly used
built-in types like int
, float
,
str
etc.
Automatic types with SQLite¶
With SQLite, the following types are inferred:
int
:INTEGER
float
:REAL
str
:TEXT
bytes
:BLOB
datetime.date
:DATE
datetime.datetime
:TIMESTAMP WITH TIME ZONE
datetime.timedelta
:INTERVAL
Jsonb
:JSONB
Is one column per metric not inefficient?¶
You might ask why you should use columns for your metrics,
because that seems annoying and wasteful compared to a normalized
name+value approach like what MLflow uses (one row per value with
run ID, metric name, and timestamp). But don’t worry, because
PostgreSQL is smart and doesn’t actually store NULL
values.
It only stores values that are not NULL
and uses a bitmap per
row to keep track of them.
By contrast, each row has a fixed size header of ~23 bytes and
MLflow uses one row per metric value. Since we store many metric
values in a row we can afford really large bitmaps to track those
NULL
values before we come out worse. Even if you only ever
store one metric per row, the break even point is over 200
columns.
Full schema¶
If you wish to dive deeper into the inner workings of SQLTrack,
you can find the full base schema for PostgreSQL at
sqltrack/engines/postgres/base.sql, and for SQLite at
sqltrack/engines/sqlite/base.sql. They contain the
definitions of the corresponding tables experiments
,
experiment_links
, runs
, run_links
, and metrics
,
as well as indices and some additional definitions.