What is the difference between validation set and test set for a model?
Whose performance we should consider for judging the model?
The validation set is for adjusting a model's hyperparameters. The testing data set is the ultimate judge of model performance.
Testing data is what you hold out until very last. You only run your model on it once. You don’t make any changes or adjustments to your model after that. Your score on the testing data is the final answer for your model performance. I’ve also heard heard this called evaluation data.
Training data is what you use to get your model parameters dialed in. If you’re doing linear regression, it’s the data you use to find your linear coefficients. If you’re building a neural network, it’s the data you use to find the weights connecting nodes between layers. You can train your model on some or all of your training data as many times as you want.
Some algorithms have knobs you can dial, settings you can adjust that affect how it performs. In linear regression, you can take measures to ensure that your coefficients don't grow out of control, such as lasso regression ( l1 regularization) or ridge regression (l2 regularization). In these cases, there are constants you have to choose that control how aggressively regularization takes place. Constants like this are often called hyperparameters, especially when they occur in neural networks, like the number of layers or the learning rate. Hyperparameters are also parameters of the model, but they don’t get adjusted the same way the other parameters do. To find the best values for hyperparameters, you have to try out multiple combinations, and re-fitting the rest of your model parameters each time. To keep your model honest, you need to have something like a testing data set that you can use over and over again to evaluate the model fit on each new hyperparameter values and see which combination gives the best performance. This is your validation set. It’s like a test data set in that you don’t use it to find your model parameters, but unlike a test data set in that you are allowed to use it over and over. I've also heard this called a tuning data set.
As if that weren’t enough, an approach called cross validation makes it even more complex. You still start by setting aside a test data set that you will only touch once at the very end. The rest of your data you split into training and validation sets, make your fits and your hyperparameter fits, then re-combine your training and validation data and split it again a different way, repeating the process many times. There are lots of different schemes for doing this with names like k-fold, leave-one-out, and Monte Carlo cross validation, but they all follow a similar pattern.
If you would like to implement these concepts in all their glorious detail, I recommend enrolling in Course 213, Nonlinear Modeling and Optimization. It's a case study in which we step through the process of creating training, validation, and testing sets to choose between polynomial models by means of Monte Carlo cross validation.
Also, keep your eyes open for Course 313, Hyperparameter Tuning in Neural Networks, due out this winter. Neural networks have a menagerie of hyperparameters, and there are many schemes for choosing a good combination of them. In this course, you will get to code some up for yourself and see how they work. You'll get deep hands-on experience with training, validation, and testing data sets.