Discover more from Santiago and the ML Models
Writing my master's thesis in public
Choosing a thesis project, finding a supervisor and next steps
Some of you may know that I'll be finishing my Master's degree in data science soon. The only missing requirement before graduating is to write and defend my thesis 😅
But writing a thesis can be a long, overwhelming, and lonely process, so I wanted to try a new experiment!
Last week I posted on twitter and linkedin that I will be writing my thesis in public. This means that I plan to share as much as I can about the behind-the-scenes of my thesis project. (This email is the first of the series).
At the end of last month, I officially started thinking about potential thesis topics, asking professors for their opinions and seeing if they would be interested in supervising the project.
I also asked my employer (NannyML) if I could do a thesis project that somehow intersects with any of the company's current research questions. Together we discussed two ideas that interested me a lot:
Data drift detection for NLP.
Measure temporal performance degradation for ML models and assess performance estimation methods to avoid said degradation.
After a lot of thought, I went with option 2. Let me expand a bit about what it means and what I'm planning to do.
Last year a paper published in Nature magazine called Temporal quality degradation in AI models showed how ML models' performance can degrade over time. In the analysis, the authors ran multiple experiments and found that 91% of their models suffered from model degradation.
To identify temporal model performance degradation, they designed a testing framework that emulates a typical production ML model. And ran more than 20,000 dataset-model experiments following that framework.
For each experiment, they did four things:
Randomly select one year of historical data as training data
Select an ML model (between linear reg, XGBoost, Random Forest Regressor, and Multilayer Perceptron NN)
Randomly pick a future datetime point where they will test the model
Calculate the model's performance change.
I plan to replicate this testing framework, check if I can find similar results with other datasets, and go a step further by measuring how many of the degradation issues would have been alarmed/avoided by performance estimation methods (CBPE, DLE) provided in the NannyML library.
What I've done so far
Find a thesis supervisor. I was lucky enough that the first professor I emailed with the thesis proposal found the topic super exciting and is willing to supervise it 😄
Start the search for datasets. Finding suitable temporal datasets is not an easy task. I've reviewed around 50 datasets this week, and I plan to continue the search for at least one week more.
I want to use four datasets, each from a different industry (to avoid industry bias). And ideally, two will be related to a regression task, and the other two to a classification task.
Read and re-read multiple times the main paper on which my research will be based.
Speaking about this paper, I wrote a blog post where I explain in detail every aspect of the article. This will be published in the following weeks on NannyML's blog, so stay tuned for it.
Decide on the four datasets.
Start the implementation of the testing framework (I'm planning to open-source this as well 🙌).
Initial EDA of at least one of the datasets.
Is there anything in particular that you would like to know more about? Or something that you would be interested to see in the following email?
Feel free to leave a comment with your thoughts!