Machine learning pre-launch checklist

If this hasn't happened to you yet, you're wiser than I will ever be: I launch a big training run, and come back hours, days, or even weeks later to find useless results. What could I have done? This is a list of things to do right before firing off your machine learning algorithm. Think of it as a pre-launch checklist. The executive summary is, "What basic facts do I know about this problem that the algorithm doesn't know?"
  1. Set bounds on "reasonable" inputs You probably have an idea of what reasonable data is - and your training set may already have violated it. For example, check for datapoints with features that are very similar, but with labels that are very different. To state the obvious, the optimizer will not be able to tell that these datapoints are mistakes. If they're really wild, it will think they're the most important part. And it will steamroll the rest of the training while trying to fit the impossible.
  2. Set bounds on "reasonable" outputs There are always some kind of bounds on the outputs. If you're like me, you may only think of these bounds once you read your outputs and begin to curse fate. So, try to imagine that moment now, and add some bounds. Simplest way is to rewind or just kill the optimizer once they're exceeded.
  3. Limit sensitivity to small changes in input Few real things you want to model are highly sensitive to small changes in input. Sensitivity is closely related to "model capacity": you don't want your model to be capable of far higher variability than the real thing being modeled. Check your regularization, and estimate the maximum sensitivity of your model, given your "reasonable" assumptions about your dataset (which you confirmed in steps 1 and 2). Consider data augmentation with pseudo-data, based on your existing data, with identical labels but slight variation in features. Even if regularization fails, this will tell the optimizer not to take small differences too seriously.
  4. Check what's missing from the training set Think about all plausible inputs. The world is broad and wide! Compare that to your training set, which probably isn't. If (when) you only have partial coverage, consider data augmentation with some dummy points which at least don't violate your common sense. Make sure they don't conflict with the real data. They may not be right, but they're less wrong than what your optimizer would come up with, since it has no common sense at all.
  5. Estimate relative importance of the datapoints Some data is less likely to appear in practice. Some has high uncertainty. Some is that dummy data I've been telling you to add, which has REALLY high uncertainty because you made it up. These points are useful, but they have the power to distort your results if taken too seriously - and the optimizer will absolutely take them seriously. You have to tell it not to do that. Downweight them in your loss function, with some weight between 0.001 and 1.0 that of the trusted points.
  6. Try a quick internet search... For example, searching just the name of this checklist, I find: machinelearningmastery.com/machine-learning-checklist medium.com/@subhojit20_27731/ml-project-checklist Filled with good advice! These lists not quite last-minute checklists; more full walkthroughs of how to solve a problem. For people like me, who start building immediately and starting thinking later, a last-minute checklist is more apt. But even I sometimes check the first page of search results, and it mostly turns out to be a good thing. The internet has a lot of smart people, and they've been doing this for a while. What has been will be again, what has been done will be done again; there is nothing new under the sun. --Ecclesiastes, c. 450-200 BCE
All items checked? Great. Go make that machine learn!