Week 1

Setting up ML Applications

Applied ML is highly iterative - decide hyperparameters, code, evaluate, reiterate. So quick prototyping is key for ensuring success of the ML project

Train / Dev / Test Sets

Description
- Train: Data used for training the model
- Dev, Development, Hold-out Cross-Validation Set: Data used to see how well the trained model(s) are doing, to improve/tweak the models by changing hyperparameters etc.
- Test: Data that the models have not seen. Used to finally evaluate the models. Performance on this data provides a measure of real-world performance of the model
- Not having a test set might be okay, if an unbiased estimate of model's performance is not required. In that case, there would be only two sets train and dev. Some might call the dev set as test in this case, but that is not technically correct, since the dev/test set here is being used to improve the model by means of overfitting
Ratio
- Traditionally, a 70/30 of train/test or 60/20/20 distribution of train/dev/test was used for datasets of upto 10**4 examples
- In the "big data" era, examples of 106 and above, we don't need as big dev and test sets. E.g. In case of 106 total examples, 10**4 examples each for dev and test are enough. This means a train/dev/test split of 98/1/1
Mismatched Distribution
- Ideally, all train/dev/test sets should be from the same distribution, but coming up with an enormous number of examples in the train set is not always possible.
- As a rule of thumb, dev and test should always come from the same distribution, even if train has data from a distribution differing dev and test. E.g. train contains cat images from webpages, dev and test might contain cat images from the users using our app

Bias and Variance

Bias is an indicator of training set performance. If it is high, then it means we are underfitting on our train set and there is scope to learn more from the train data
Variance is an indicator of test set performance. If it is high, it means we are overfitting on train set and the model cannot generalize well on the data it has not seen

Image 1: Bias and Variance Visualized for 2 dimensional data

Image 2: High Bias and High Variance visualized for 2 dimension data. The classifier is mostly linear with some highly complex non-linear elements
Visualizing bias and variance beyond 2 dimensions is difficult. In that case, we resort to Train Set Error and Dev Set Error

Image 3: Example of low/high bias and variance values based on Train and Dev set errors. Optimal Bayes Error is ~0%
Optimal or Bayes error: This gives an indication of the error value of an optimal classifier. Think of it like a benchmark value for the error. In the above example, we assume that the Optimal error is nearly 0% since humans can identify a cat image with almost no error. Optimal error could be 14% if the images were blurred to the point where machines or even humans cannot determine the right class of the object
Bias and Variance Trade-off? Not Really in DL
- In contrast to traditional ML, in DL we have tools that can help us influence bias (training set performance) or variance (test set performance) separately

Recipe for ML

Image 4: Flow chart describing steps to tackle a ML problem. Try and reduce high bias first, then try reducing high variance

Regularization

Need: When we observe overfitting or high variance in our NN, we should try regularization. The other alternative of adding more training data might be harder
A regularization term is added to the cost function

Week 1

Setting up ML Applications

Train / Dev / Test Sets

Bias and Variance

Recipe for ML

Regularization

L1 and L2 Regularization