ML - Cookie Cutter

Introduction

My first blog was on the beginner's guide to DS and ML and how to structure your files and folders while running experiments on Jupyter notebook. In this article, I will talk about the next phase i.e. structuring your entire ML project in production.

ML in Production

Once your model is ready, you need to create a folder structure to manage your source code, data, charts, hyper parameters, models, reports etc.

Wouldn't it be nice to have a custom script which gives you this skeleton - A cookie cutter, maybe?

Well, there is one called DS Cookie cutter. This is absolutely brilliant, but wanted a simple one that I could manage for better control.

Cookie-ML Project

There are two ways to use this script

1 - Using make & Makefile
2 - Run .py script directly

Flexibility is the key!

Approach One - Using make & Makefile

If you have not heard about make utility, it's very popular for automating the compilation of C/C++ programs and their dependencies. It is already installed on Linux & Mac OS.

Prerequisites

The prerequisites is to have a virtual environment created. There are three ways of doing this:

  • conda - easier and works well but only during experimental phase of your project
  • virtualenv - was popular once and works well with pip
  • pipenv - is the most preferred method as it combines pip & virtualenv

Once you activate the virtual environment, just run the following command and it will create a folder structure - ml-project

make build

To install ML libraries, update the requirements.txt and run

make install_libraries

If you want to see more options use :

make

Approach Two - python script

If you want to keep it simple just use the command below and it will create the structure outside the current directory you are in.

Prerequisites

The only prerequisites is to have python 3.x installed.
Optionally - If you have the virtual env created, you can activate it and then run :

python template.py folder-name

folder-name is the required parameter, just make sure you do not have another folder with the same name. It is that simple!!

Folder Structure

This section provides a quick reference on the structure.

tree-structure.png

Structure details

src - Most of your code lives in this folder e.g. main.py, preprocess.py, visualizer.py etc.

requirements.txt - If you use conda or virtualenv to install ML libraries then update this file but if you prefer pipenv then replace it with Pipfile. Will be supporting this in the next release.

reports - All reports are stored here after data processing/cleaning. Store reports (xlsx, csv format) to be sent at regular intervals.

plots - Storing graphs(png/jpg). This can be used for presentation/publishing papers or used in project documentation.

params.yaml - YAML file for storing data, model configurations.

notebook - Stores all your jupyter notebooks which is used during your experiment/research phase . The sub folders are :-
nb_research - to store all your artifacts, notes, reference links.
nb_report - Storing all your sample reports.
nb_model - For storing your params, model during your experiment phase.
nb_data - folder contains all the data used during your research. project_name_EDA_ML_Experiments.ipynb - filename is same as project name. project_name.ipynb - Final clean code lives here.

models - to store all the models you would have trained with lots of hyper params.

logs - stores the logs, to be consumed by tools e.g Promethus/Grafana.

docs - for building documentation - all your artifacts can be used for creating documentation.
You can use MKDocs or any static website generator. If you know React JS - try Docusaurus.

data - to store data. Following are the sub folders:
train - For training you model
testing - For storing unseen data
raw - original data
processed - cleaned/transformed data which will be split into train or test data

Readme.md - Landing page of your project which is a markdown file.
Makefile - To automate your script and it's dependencies.

Complete code is available here at Github.

Future Support

  • Virtual environment
  • DVC (Data Version Control)
  • Docker
  • MKDocs & Docusaurus

Conclusion

Stephen Hawking once said :

One of the basic rules of the universe is that nothing is perfect.

So is my script, but it has worked well for me and it gets the job done!

If you have any questions or feedback - do reach out to me on Twitter or Linkedln

Happy Learning!!