Why we made SQLTrack#

Alternative title: “a rant about experiment tracking”.

For some reason tracking experiments is still hard today. It shouldn’t be. Here’s our thoughts on the topic. Just bullet points for now, since we can’t be bothered to ask ChatGPT to write it for us.

How to keep track of your experiment results?
- Experiment data is precious
- Structure is necessary to compare experiment runs
- Ad-hoc methods (spreadsheets et al.) can work well, but become cumbersome for hundreds/thousands of runs
- Need automated solutions
Are hosted tracking services a good idea?
- Many solutions (like wandb) to choose from
- Typically a free-tier for individuals, but very costly for teams
- Service can go down, or become slow (will happen right before a paper deadline)
- Service provider can change conditions, effectively hold data hostage (e.g., introducing a monthly tracking time limit for free users)
- Self-hosing is a must!
What self-hosted solutions are there?
- Basically mlflow
  - Basic use cases are easy enough
  - Terse, but OK documentation
  - Comes with a first-party web GUI
- Alternatives
  - Sacred
    - OG tracking & reproducibility tool
    - Cool features like code versioning and automatic arameter parsing
    - GUI frontends are available, but none of them do what we need
  - ploomber-engine
    - Inspiration for our solution, but lots of limitations
    - Flat hierarchy with just experiments, no runs
    - Only one set of metrics per experiment
    - Relies on magic to detect metrics from global scope
  - MLTRAQ
    - Similar to our solution
    - DB schema with one row per experiment and deeply nested JSON columns
    - Every time a metric is added the whole row needs to be rewritten
    - Should be a performance nightmare
- Are we missing something? Let us know
Our issues with mlflow
- No concept of authentication, users, permissions, … need to do everything yourself
- By default all tracked parameters & environment are display in GUI
  - Need to select relevant columns
  - URL is used to store settings, including selected columns
  - Selection stops working if you have too many columns, because URL is too long
- Cannot change the order of columns in tables
- Run overview always shows lastest metric value
  - Schema makes aggregation over metric tables slow
  - A separate table with the lastest value per run is used as a workaround
  - Other aggretations could be done similarly, but it is difficult to add them and this doesn’t scale
- Runs cannot have relationships to other runs
  - Pre-training? Fine-tuning?
- Graphs are too small
- So much clicking, let us program our analyses already!
Our solution: just use SQL!
- SQL is almost 50 years old and still relevant, so it has to have done something right
  - Mature ecosystem with great tools and tons of great resources to learn
  - A lot of people know it already
  - Fine-grained user privilege controls down to single tables
- Experiment data is not actually that complex, easy to map to relational DBs, especially with modern features like JSON columns
- You know your experiments, just define your metrics as columns, avoid mlflow performance problems
- Build whichever analyses you like, display them wherever, e.g. as reports with notes in Jupyter Notebooks
- Trivial conversion from SQL to Pandas Dataframe
- SQL + Pandas + Jupyter = insane flexibility FOR FREE