All Articles

Why do we split the data into train and test sets?

Imagine you’re teaching a few students. You’ve spent the last 3 weeks teaching them how to solve a quadratic equation. You’ve announced an upcoming test next week and released a sample test containing 20 questions.

The students will apply one of the two available strategies:

  1. Cram the sample test and hope the actual exam’s the same
  2. Use the sample test to judge their level of understanding and go back to the source material and fill in the gaps

Clearly, strategy 1 is the path of least resistance but strategy 2 is what will help a student truly master the course material. As a teacher, you have 2 strategies available to evaluate your students:

  1. Make the actual test to be a 1:1 copy of the sample test
  2. Design a different test that uses the same concepts as the sample test

As a teacher, using strategy 1 won’t give you a good measure of how well the students understood the content. You’ll only be rewarding the students who’ve memorized the sample test. However, deploying the second strategy will give you a good measure of how much the students have learned.

split image
Spliting a dataset into 2 pieces: Training & Test

Continuing the analogy from above, we follow a similar strategy while building and evaluating machine learning models. We take a portion of the data to train our model (the student) aka the “train set”. The remaining portion is used to test the model aka the “test set”. If we were to use the same data to test the model that was used to train it, the model will perform remarkably well. This isn’t good. This leads to a model that has “overfit”, i.e. it has memorized the data and cannot generalize its “learnings” to unseen data points.

Summary

  • Testing the model on the same data as it was trained on will lead to an overfit and poor performance in real-life scenarios.
  • In order to avoid that, split your data into 2 pieces: train set and test set. The most common practice is to do a 80-20 split.