What is Ensemble method?
In real-world problems, it is obvious that we can end up in better conclusions if we ask for others' opinion and get advice. It works the same with making machine learning models: if we take a look at the results of various solutions and pick the best among them or deciding by majority, we can make better conclusion.
Models trained for drawing conlcusions are called "ensemble", so this whole process is called "ensemble learning".
Actually most of the winning solutions of machine learning contests use ensemble method.
There are various kinds of ensemble methods and these are 4 most popular ones:
- Voting method
- Bagging / Pasting
- Boosting
- Stacking
Voting method
There are two different voting methods:
- Hard voting: deciding by majority among various machine learning models' test results
- Result of hard voting is usually better than the best result among the models
- Weak learner, which means models that is slightly better than random, can be strong learner if we utilize ensemble methods
- The assumption that using ensemble learning will result better is eligible if each model is sufficiently independent and the correlation between the error of each model is sufficiently low.
- However, each model will be somewhat dependent and the error for each model will be somewhat related because we have limited train data in many cases.
- Therefore, we should focus on making the models as independent as possible to make the best performance in ensemble learning.
- One of the solutions for this problem is to use totally different algorithm for each of the model.
- Soft voting: Only if we can know the probabilitites of the results derived from each model, we can average the probabilities to get the best result.
- Usually performs better than hard voting because it gives heavier weight to the model that has higher probability to its result
Bagging / Pasting
Instead of using totally different algorithms for each of the model to upgrade of the ensemble learning's performance, we can use the same algorithm and give variation to the train dataset.
We can basically randomly construct the subset of train dataset. If we allow duplication during sampling the train dataset, we call it "bagging", which is an abbreviation of "bootstrap aggregating". If we don't, we call it "pasting".
Thus, we can use same train data sample for multiple models. However, we can sample same train data multiple times for single model only if we use bagging method.
After we get the result of each models, we usually use statistical mode if it is a classification problem and average if it is regression problem.
Bagging usually performs better than Pasting because the dataset will be more varied because adding more variety(sampling duplication) reduces the correlation between models.
However, it is always better to check which method performs better by actually trying it.
Boosting
Boosting method means connecting several weak models to make a stronger one. The key point of this method is to upgrading the model by revising previous one.
There are several kinds of boosting methods but these 2 are the most popular ones:
- AdaBoost(Adaptive Boosting):
- AdaBoost complements the previous model by increasing the weights of training samples where the previous model was underfitted or classified wrong for classification problems.
- By using this method, the new model becomes progressively fitted to the samples that is difficult to learn.
- Most popular one
- Gradient Boosting:
- Rather than adjusting sample weights for each iteration, the new model is trained on the residual errors produced by the previous predictions.
Stacking (abbreviation of "stacked generalization")
Stacking method's main idea is to train another model that ensembles predicting models instead of simply conducting votes.
The model that ensembles the predicting models and make the final answer is called "blender" or "meta learner".
Most usual way to train the blender is to use hold-out dataset. Sometimes we use out-of-fold prediction, too.
- blending process using hold-out dataset:
- detailed process of hold-out method:
- In the first image above, we can see that we first split the training set into 2 subsets(hold-out method).
- Then, we train 3 different prediction models using subset 1.
- We can now make 3 different predictions with subset 2 as we can see from the next image.
- Next we should make a new dataset(blending training set) with 1. predicted result 2. target value for blending model and train the blending model.
- out-of-fold prediction is gathering the validation result from validation fold while conducting k-fold cross-validation.
It is also possible to train multiple blenders at the same time, which makes a form of 2 layers (prediction layer and blending layer).
- To make two layers just as image above, we should first divide the training dataset into 3 subsets.
- Layer 1:
- We use 3 training sets to train 3 different prediction models.
- Layer 2:
- We use the prediction results of 3 prediction models to train 3 blender models.
- Each prediction results of 3 prediction models are delivered to all 3 blender models.
- Layer 3:
- We should train the final blender model to combine the outputs of 3 blender models and make final prediction.
Reference
Believe in yourself and all that you are. Know that there is something inside you that is greater than any obstacle.
- Max Holloway -
'캐글' 카테고리의 다른 글
[Kaggle Study] 2024 How I am studying (3) | 2024.10.25 |
---|---|
[Kaggle Study] 1. Loss Function 손실 함수 & Gradient Descent 경사 하강법 (4) | 2024.10.25 |
[Kaggle Extra Study] 5. Cross Validation 교차 검증 (3) | 2024.10.23 |
[Kaggle Extra Study] 4. Curse of Dimensionality 차원의 저주 (9) | 2024.10.23 |
[Kaggle Extra Study] 3. Time-series Data (2) | 2024.10.22 |