Overfitting and Underfitting
Before we talk about what cross validation method is, we should first think about overfitting and underfitting in terms of machine learning.
In general ml problems, we utilize explanatory variables from train data to predict the target variable in the test data. However, test data would not have target variable which makes difficult for us to validate the performance of the ml model. Using the train data again for validation is not a best practice because we already use that data for training the model.
To make the model be generalized, which means that it shows decent performance in spite of the train data, we should apply special methods(perhaps cross validation). This prevents the model from being overfitted or underfitted.
- Overfitting: When the model is excessively trained on the train data so that it cannot be generalized: performs good only for specific train data(the data used for training).
- Underfitting: When the model is not sufficiently trained.
There are various methods for preventing the model from being overfitted:
- Hold out
- Cross Validation
- Leave-one-out
Key point of these methods is to separate validation data from train data to make the model generalized.
You can find more about overfitting and underfitting from my other post:
Hold out
Hold out method is to simply divide the original train data into 1. train data and 2. validation in specific ratio(8:2 for example).
One thing we should aware while using hold out method is that the model can be easily overfitted if we care too much about the validation score because the validation data is fixed.
We can use cross validation method to get rid of this problem.
Cross Validation
Cross validation method 1. divides the whole train data into k subsets 2. consider each of the subset as validation dataset and use the rest to train the model 3. iterate over the process for k times.
Comparing with hold out method, we can make more generalized model because it is possible to validate the model for k different random datasets.
- When the amount of data is too small, it's good to use cross-validation instead of splitting into a validation set.
- Generally, if you have around 100,000 data points, split them in a ratio of 8:1:1.
- For datasets with over 1 million samples, divide them in a ratio of approximately 98:1:1.
- Generally, if you can secure more than 10,000 samples for validation and test sets, it's better to allocate more samples to the training set.
There are several kinds of cross validation methods but these are the 2 most popular ones:
- standard k-fold cross validation: most basic one which is explained above
- usually 5 or 10 is used for k
- pro: we can utilize all data for train AND validation -> generalization
- con: we should compute the same process k times
- stratified k-fold cross validation: enhanced version of standard k-fold cross validation method that ensures the preservation of class proportion in both training and validation datasets
- maintains the same class distribution in each fold as in the complete dataset
- if the overall dataset has a positive:negative ratio of 3:7, each fold maintains the same ratio
- works well with imbalanced datasets (e.g., medical diagnosis, fraud detection: dataset that has minority classes)
- usually used for binary or multi-class classification problems
- pro: prevents biased evaluation
Leave-one-out
Leave-one-out method is similar with k-fold cross validation method but every "one" of the train dataset becomes the validation dataset.
In other words, we should iterate over the total number of train data to update the model. Therefore, this method should only be used if the train data is not that large.
- pro: less biased model -> more generalization
- con: computation even bigger than k-fold cross validation
There are also Leave-p-out method that k becomes specific number p.
Reference
He that can have patience can have what he will.
- Benjamin Franklin -
'캐글' 카테고리의 다른 글
[Kaggle Study] 1. Loss Function 손실 함수 & Gradient Descent 경사 하강법 (4) | 2024.10.25 |
---|---|
[Kaggle Extra Study] 6. Ensemble Method 앙상블 기법 (3) | 2024.10.24 |
[Kaggle Extra Study] 4. Curse of Dimensionality 차원의 저주 (9) | 2024.10.23 |
[Kaggle Extra Study] 3. Time-series Data (2) | 2024.10.22 |
[Kaggle Extra Study] 2. AutoEncoder (4) | 2024.10.22 |