Why we made SQLTrack#
Alternative title: “a rant about experiment tracking”.
For some reason tracking experiments is still hard today. It shouldn’t be. Here’s our thoughts on the topic. Just bullet points for now, since we can’t be bothered to ask ChatGPT to write it for us.
How to keep track of your experiment results?
Experiment data is precious
Structure is necessary to compare experiment runs
Ad-hoc methods (spreadsheets et al.) can work well, but become cumbersome for hundreds/thousands of runs
Need automated solutions
Are hosted tracking services a good idea?
Many solutions (like wandb) to choose from
Typically a free-tier for individuals, but very costly for teams
Service can go down, or become slow (will happen right before a paper deadline)
Service provider can change conditions, effectively hold data hostage (e.g., introducing a monthly tracking time limit for free users)
Self-hosing is a must!
What self-hosted solutions are there?
Basically mlflow
Basic use cases are easy enough
Terse, but OK documentation
Comes with a first-party web GUI
Alternatives
-
OG tracking & reproducibility tool
Cool features like code versioning and automatic arameter parsing
GUI frontends are available, but none of them do what we need
-
Inspiration for our solution, but lots of limitations
Flat hierarchy with just experiments, no runs
Only one set of metrics per experiment
Relies on magic to detect metrics from global scope
-
Similar to our solution
DB schema with one row per experiment and deeply nested JSON columns
Every time a metric is added the whole row needs to be rewritten
Should be a performance nightmare
-
Are we missing something? Let us know
Our issues with mlflow
No concept of authentication, users, permissions, … need to do everything yourself
By default all tracked parameters & environment are display in GUI
Need to select relevant columns
URL is used to store settings, including selected columns
Selection stops working if you have too many columns, because URL is too long
Cannot change the order of columns in tables
Run overview always shows lastest metric value
Schema makes aggregation over metric tables slow
A separate table with the lastest value per run is used as a workaround
Other aggretations could be done similarly, but it is difficult to add them and this doesn’t scale
Runs cannot have relationships to other runs
Pre-training? Fine-tuning?
Graphs are too small
So much clicking, let us program our analyses already!
Our solution: just use SQL!
SQL is almost 50 years old and still relevant, so it has to have done something right
Mature ecosystem with great tools and tons of great resources to learn
A lot of people know it already
Fine-grained user privilege controls down to single tables
Experiment data is not actually that complex, easy to map to relational DBs, especially with modern features like JSON columns
You know your experiments, just define your metrics as columns, avoid mlflow performance problems
Build whichever analyses you like, display them wherever, e.g. as reports with notes in Jupyter Notebooks
Trivial conversion from SQL to Pandas Dataframe
SQL + Pandas + Jupyter = insane flexibility FOR FREE