Getting started#
Currently SQLTrack supports PostgreSQL through the psycopg driver. We don’t plan on adding support forany other databases, except SQLite if there is demand for it. We’ve tried using ORMs, but found that they made things way more complicated than they needed to be and - most importantly - they obfuscated the DB schema from us. Ideally we would use standard SQL and let users bring their own Python DB-API 2.0 compatible driver, but that would mean we lose access to advanced features like indexable JSONB columns.
Installation#
SQLTrack can be installed like any other Python package,
e.g., pip install sqltrack
.
By default only core dependencies are installed,
which speeds up usage in containerized environments.
Core functionality located in the toplevel
sqltrack
package allows tracking experiments
and working with the database.
To use some of the convenience functions for anaylsis later,
install the full package with pip install sqltrack[full]
.
On Linux, your distribution repositories should include a version of PostgreSQL you can use. There are also install instructions for MacOS and Windows. We develop against PostgreSQL 13, but any currently supported version should work.
Database creation#
Todo
Creating a database:
CREATE DATABASE ${USERNAME};
Create a user:
CREATE USER ${USERNAME};
REVOKE CONNECT ON DATABASE ${USERNAME} FROM PUBLIC;
GRANT ALL PRIVILEGES ON DATABASE ${USERNAME} to ${USERNAME} WITH GRANT OPTION;
Create a schema:
CREATE SCHEMA IF NOT EXISTS ${SCHEMA};
Give a different user access to your database/schema:
GRANT CONNECT ON DATABASE ${DATABASE} TO ${USERNAME};
GRANT USAGE ON SCHEMA ${SCHEMA} TO ${USERNAME};
GRANT SELECT ON ALL TABLES IN SCHEMA ${SCHEMA} TO ${USERNAME};
Base schema#
This is the basic schema SQLTrack defines (minus some details like indexes),
with tables experiments
, experiment_links
, runs
,
run_links
, and metrics
.
runs.status
has the custom enum type runstatus
.
It behaves like text when used with the psycopg driver.
Possible values have been lifted from Slurm job status.
BEGIN;
CREATE TABLE applied_migrations (
name TEXT,
PRIMARY KEY(name)
);
CREATE TABLE experiments (
id BIGINT GENERATED BY DEFAULT AS IDENTITY,
time_created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
name TEXT NOT NULL,
comment TEXT,
tags JSONB,
PRIMARY KEY (id),
UNIQUE(name)
);
CREATE INDEX experiments_tags_gin ON experiments USING GIN (tags);
CREATE INDEX experiments_tags_gin_path ON experiments USING GIN (tags jsonb_path_ops);
CREATE TABLE experiment_links (
from_id BIGINT NOT NULL,
kind TEXT NOT NULL,
to_id BIGINT NOT NULL,
PRIMARY KEY(from_id, kind, to_id),
FOREIGN KEY(from_id) REFERENCES experiments(id),
FOREIGN KEY(to_id) REFERENCES experiments(id)
);
CREATE TYPE runstatus AS ENUM (
'BOOT_FAIL',
'CANCELLED',
'CONFIGURING',
'COMPLETED',
'COMPLETING',
'DEADLINE',
'FAILED',
'NODE_FAIL',
'OUT_OF_MEMORY',
'PENDING',
'PREEMPTED',
'RESV_DEL_HOLD',
'REQUEUE_FED',
'REQUEUE_HOLD',
'REQUEUED',
'RESIZING',
'REVOKED',
'RUNNING',
'SIGNALING',
'SPECIAL_EXIT',
'STAGE_OUT',
'STOPPED',
'SUSPENDED',
'TIMEOUT'
);
CREATE TABLE runs (
id BIGINT GENERATED BY DEFAULT AS IDENTITY,
experiment_id BIGINT NOT NULL,
status runstatus NOT NULL DEFAULT 'PENDING',
time_created TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
time_started TIMESTAMP WITH TIME ZONE,
time_updated TIMESTAMP WITH TIME ZONE,
comment TEXT,
tags JSONB,
args JSONB,
env JSONB,
PRIMARY KEY(id),
FOREIGN KEY(experiment_id) REFERENCES experiments(id) ON DELETE CASCADE
);
CREATE INDEX runs_tags_gin ON runs USING GIN (tags);
CREATE INDEX runs_tags_gin_path ON runs USING GIN (tags jsonb_path_ops);
CREATE INDEX runs_args_gin ON runs USING GIN (args);
CREATE INDEX runs_args_gin_path ON runs USING GIN (args jsonb_path_ops);
CREATE INDEX runs_env_gin ON runs USING GIN (env);
CREATE INDEX runs_env_gin_path ON runs USING GIN (env jsonb_path_ops);
CREATE TABLE run_links (
from_id BIGINT NOT NULL,
kind TEXT NOT NULL,
to_id BIGINT NOT NULL,
PRIMARY KEY(from_id, kind, to_id),
FOREIGN KEY(from_id) REFERENCES runs(id) ON DELETE CASCADE,
FOREIGN KEY(to_id) REFERENCES runs(id) ON DELETE CASCADE
);
CREATE TABLE metrics (
run_id INTEGER NOT NULL,
step BIGINT NOT NULL DEFAULT 0,
progress DOUBLE PRECISION NULL DEFAULT 0.0,
PRIMARY KEY (run_id, step, progress),
FOREIGN KEY(run_id) REFERENCES runs(id) ON DELETE CASCADE
);
END;
Defining metrics#
Note that the metrics
table doesn’t contain any columns to store metrics yet.
Users need to add these as required.
E.g., a script to add columns for timing, loss, and accuracy in
train, validation, and test phases could look like this:
BEGIN;
ALTER TABLE metrics
ADD COLUMN train_start TIMESTAMP WITH TIME ZONE,
ADD COLUMN train_end TIMESTAMP WITH TIME ZONE,
ADD COLUMN train_loss FLOAT,
ADD COLUMN train_top1 FLOAT,
ADD COLUMN train_top5 FLOAT,
ADD COLUMN val_start TIMESTAMP WITH TIME ZONE,
ADD COLUMN val_end TIMESTAMP WITH TIME ZONE,
ADD COLUMN val_loss FLOAT,
ADD COLUMN val_top1 FLOAT,
ADD COLUMN val_top5 FLOAT,
ADD COLUMN test_start TIMESTAMP WITH TIME ZONE,
ADD COLUMN test_end TIMESTAMP WITH TIME ZONE,
ADD COLUMN test_loss FLOAT,
ADD COLUMN test_top1 FLOAT,
ADD COLUMN test_top5 FLOAT;
END;
Now you might ask why we make you add columns for your metrics,
because that might seem annoying and wasteful compared
to a normalized name+value approach like what mlflow
uses
(one row per value with run ID, metric name, and timestamp).
But don’t worry, because PostgreSQL is smart and doesn’t
actually store NULL
values.
It only stores values that are not NULL
and uses a bitmap
per row to keep track of them.
Also, each row has a fixed size header of ~23 bytes and mlflow
uses one row per metric value.
Since we store many metric values in a row we can afford
really large bitmaps to track those NULL
values before we
come out worse.
Put your instructions to add metrics columns etc. in a SQL
script file, e.g. v001.sql
, for use later.
Add v002.sql
etc. to update your schema.
Setup the database#
SQLTrack provides a simple tool to setup your database.
$ sqltrack -h
usage: sqltrack [-h] [-u USER] [-a HOST] [-d DATABASE] [-s SCHEMA]
[-c CONFIG_PATH]
{setup} ...
positional arguments:
{setup} Available commands.
setup Setup (and update) the database.
options:
-h, --help show this help message and exit
-u USER, --user USER username
-a HOST, --host HOST DB host (and port)
-d DATABASE, --database DATABASE
database name
-s SCHEMA, --schema SCHEMA
schema name
-c CONFIG_PATH, --config-path CONFIG_PATH
path to config file
User, host, database, and schema as parameters given on the command line take priority,
but you can also define environment variables SQLTRACK_DSN_<PARAM>
to set them.
More info on available parameters can be found
here.
Finally, most convenient is probably to store them in a config file.
The default path is ./sqltrack.conf
user=<USER>
host=<HOST>
database=<DATABASE>
schema=<SCHEMA>
Those SQL script files you created earlier?
This is where you use them.
Run the setup command with them, e.g. sqltrack setup v001.sql
.
This creates the base schema and updates it with your definitions.
Track an experiment#
from random import random
import sqltrack
def main():
client = sqltrack.Client()
experiment = sqltrack.Experiment(client, name="Very science, much data")
run = experiment.get_run()
with run.track():
for epoch in range(90):
metrics = {"train_loss": random(), "train_top1": random()}
run.add_metrics(step=epoch, progress=epoch/epochs, **metrics)
Analyzing results#
This is where it’s up to you. We recommend Jupyter Lab to interact with the database, but plain Jupyter or alternatives like Plotly Dash <https://dash.plotly.com/introduction> work well too. Look at the examples directory in our repository to get some ideas. But really, you’re the experimenter, you know best what to do with your data.
To find out how SQLTrack can help you create tools that are exactly right for you, head on over to our guide on how to analyze experiments.