캐글

[Kaggle Study] 14. Hyperparameter Tuning

dongsunseng 2024. 11. 16. 23:36
반응형

Hyperparameter Tuning?

  • The process of adjusting hyperparameters to optimize the model for better performance
  • There are certain hyperparameters that have higer tuning priority than others.
    • In other words, some hyperparameters can work well with commonly known values without tuning, while others require tuning to determine which values work best.
  • There are various tuning methods such as:
    • Manual Search
    • Grid Search
    • Random Search
    • Bayesian Optimization
    • Non-Probabilistic
    • Evolutionary Optimization
    • Gradient-based Optimization
    • Ensemble-based Optimization
    • Early Stopping
    • and more
  • All methods except Manual Search are called 'Automated Hyperparameter Selection'

  • This image indicates that using random values is much better way than using grid to test hyperparameters.
    • When $\alpha$ is a hyperparameter that requires a lot of tuning, we can only try fixed amount of values if we use grid method. 
    • However, when we use random values, we can try much more variety of values.

  • 'Coarse to fine' search is a process that we narrow down to smaller range(square in the image) when we find several values that work well and then try more densely in that range of values.

#1 Manual Search

  • Also called 'rules of thumb', which referes to setting hyperparameter values based on experience or intuition.
  • In reality, it is convenient to follow commonly known values since they usually perform well and only require minor code adjustments.
  • However, the downside is that it's difficult to compare performance across different hyperparameter combinations.

#2 Grid Search

  • Most basic hyperparameter optimization method.
  • Grid Search trains the model with all possible hyperparameter combinations to find the optimal combination, which is called 'exhaustive searching'.
  • For some hyperparameter, since there are no defined ranges, users may need to set boundaries or specific values.
  • The disadvantage is that it takes a very long time when there are many hyperparameters or large datasets since it examines all possibilities.

  • The hyperparameters being searched for here are x1 and x2.
  • Within the range (0, 1), using 10 values makes 100 possible combinations.
  • True to its name "grid search," the combinations are arranged in coordinates.
  • The blue contour lines represent combinations with high performance, while the red contour lines represent combinations with low performance.
from sklearn.model_selection import GridSearchCV

grid_search = {'criterion': ['entropy', 'gini'],
               'max_depth': [2],
               'max_features': ['auto', 'sqrt'],
               'min_samples_leaf': [4, 6, 8],
               'min_samples_split': [5, 7,10],
               'n_estimators': [20]}

clf = RandomForestClassifier()
grid = GridSearchCV(estimator = clf, param_grid = grid_search, 
                               cv = 4, verbose= 5, n_jobs = -1)
grid.fit(X_Train,Y_Train)

grid_pf = grid.best_estimator_.predict(X_Test)
grid_acc = accuracy_score(Y_Test,grid_pf)

print(grid.best_params_)
print(grid_acc)

# result:
# {'criterion': 'gini',
#	'max_depth': 2,
#	'max_features': 'auto',
#	'min_samples_leaf': 4,
#	'min_samples_split': 5,
#	'n_estimators': 20}

#0.9992392589211521
  • The variable 'grid_search' is used to set the search range for grid search.
  • Other methods besides grid search also require defining their search ranges.

#3 Random Search

  • Random Search is a method that finds the optimal combination by randomly sampling combinations within boundaries.
  • It's said to have better performance relative to time(시간 대비 성능) compared to Grid Search.
  • It particularly produces good results when a small number of hyperparameters influence model performance.
  • However, just as Grid Search, it's inefficient because it searches through a wide range to find optimal hyperparameters.
  • This led to the emergence of Automated hyperparameter tuning.

  • While similar to the Grid Search diagram, the difference is that Random Search randomly samples 100 combinations within an unspecified range.
  • As a result, the x values are scattered sporadically(산발적으로).
  • The green lines on the edges of the graph indicate selected values, and you can see that many values were sampled near 0.4 for x1 and near 0.6 for x2.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score

random_search = {'criterion': ['entropy', 'gini'],
               'max_depth': [2],
               'max_features': ['auto', 'sqrt'],
               'min_samples_leaf': [4, 6, 8],
               'min_samples_split': [5, 7,10],
               'n_estimators': [20]}

clf = RandomForestClassifier()
random = RandomizedSearchCV(estimator = clf, param_distributions = random_search, n_iter = 10, 
                               cv = 4, verbose= 1, random_state= 101, n_jobs = -1)
random.fit(X_Train,Y_Train)

random_pf = random.best_estimator_.predict(X_Test)
random_acc = accuracy_score(Y_Test,random_pf)

print(random.best_params_)
print(random_acc)

# Result:
# {'n_estimators': 20,
#	'min_samples_split': 10,
#	'min_samples_leaf': 4,
#	'max_features': 'auto',
#	'max_depth': 2,
#	'criterion': 'gini'}

# 0.9992509626300574

#4 Bayesian Optimization (HyperOpt)

  • Bayesian optimization is a method for finding the optimal solution that maximizes or minimizes an objective function.
  • To understand Bayesian optimization, it's important to understand these concepts:
    • Surrogate model: A model that makes probabilistic estimates of the objective function. This is our estimated objective function.
    • Acquisition function: A function that recommends the next input data (hyperparameter combination) based on the probabilistic estimates from the Surrogate model.
  • Here, the objective function refers to metrics we want to calculate, such as loss or accuracy.
  • Bayesian optimization creates a Surrogate model for combinations of objective functions and hyperparameters, evaluates it, and uses the Acquisition Function to recommend combinations for the next input.
  • This process is repeated, sequentially updating to find the optimal combination.
  • Bayesian optimization algorithms utilize prior distributions to find parameters that maximize or minimize unknown objective functions.
    • Using Bayesian optimization with prior distributions can find parameters that yield better performance compared to grid search or random search, which don't utilize model prior distributions.
    • However, Bayesian optimization doesn't show good performance for all problems and isn't a generalized algorithm for solving extremely complex problems.
    • It's a methodology primarily used in ML rather than DL, and is optimized for moderate sizes of hyperparameters and data.
    • BO (Bayesian Optimization) uses machine learning when searching for parameters - it's essentially ML for ML.
  • Generally used when the objective function contains noise.
  • Primarily used in cases where objective function calculations are not large or time-consuming (reason why it's not commonly used in DL).
    • Appropriately considers the time taken to calculate the objective function and the time needed for expected improvement

  • The figure above illustrates the Bayesian optimization process.
  • The blue range represents the uncertain range, or error margin, the points represent observations at each step, the dotted line is the objective function, the solid line is the surrogate model, the green line is the acquisition function, and the red triangle marks the maximum value of the acquisition function which becomes the next observation point.
  • At t=2, the observation showed a maximum value at the red triangle point.
  • This area becomes the observation point for the next time step, t=3.
  • As we can see from the disappearance of the blue area around this point, we can interpret this as a point where uncertainty has been eliminated.
  • With the emergence of a new maximum value, the next observation point shifts.
  • Through repeating this process, we optimize the hyperparameter combinations.

HyperOpt

  • HyperOpt is a framework which is based on Bayesian Optimization modeling.
    • There are other frameworks that provides Bayesian Optimization features such as BayesianOptimization library.
  • An automated hyperparameter tuning framework, and its fmin() function contains three parameters:
    • Objective Function: The loss function to minimize
    • Domain Space: The search space. In Bayesian optimization, this space generates statistical distributions for each hyperparameter.
    • Optimization Algorithm: The algorithm used to find the optimal combination
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

# space: 알고리즘이 탐색할 범위를 정의한다. 
# hp.choice: 리스트 내의 값을 무작위로 추출
# hp.uniform: 정의된 범위 내에서 임의의 숫자를 무작위 추출
# hp.quniform: 정의된 범위 내에서 마지막 숫자만큼의 간격을 두어 숫자를 추출 
space = {'criterion': hp.choice('criterion', ['entropy', 'gini']),
        'max_depth': hp.quniform('max_depth', 10, 12, 10),
        'max_features': hp.choice('max_features', ['auto', 'sqrt','log2', None]),
        'min_samples_leaf': hp.uniform ('min_samples_leaf', 0, 0.5),
        'min_samples_split' : hp.uniform ('min_samples_split', 0, 1),
        'n_estimators' : hp.choice('n_estimators', [10, 50])
    }

def objective(space):
    hopt = RandomForestClassifier(criterion = space['criterion'], 
                                   max_depth = space['max_depth'],
                                 max_features = space['max_features'],
                                 min_samples_leaf = space['min_samples_leaf'],
                                 min_samples_split = space['min_samples_split'],
                                 n_estimators = space['n_estimators'], 
                                 )
    
    accuracy = cross_val_score(hopt, X_Train, Y_Train, cv = 4).mean()

    # We aim to maximize accuracy, therefore we return it as a negative value
    return {'loss': -accuracy, 'status': STATUS_OK }
    
trials = Trials()
best = fmin(fn= objective,
            space= space,
            algo= tpe.suggest,
            max_evals = 20,
            trials= trials)

# 최적의 하이퍼파라미터 조합
best

''' 결과: {'criterion': 1,
'max_depth': 10.0,
'max_features': 3,
'min_samples_leaf': 0.017342284266195274,
'min_samples_split': 0.2509860841981386,
'n_estimators': 1} '''

# best를 통해 최적의 조합이 도출되었으니, 이를 바탕으로 모델을 학습시켜야 한다. 그러기 위해서는 변수들이 가진 element에 인덱스를 부여한다. 
# hp.uniform은 숫자를 무작위로 추출하기에 우리가 알 수 없으므로, choice, quniform을 메소드로 가진 변수들만 딕셔너리로 만들어준다. 
crit = {0: 'entropy', 1: 'gini'}
feat = {0: 'auto', 1: 'sqrt', 2: 'log2', 3: None}
est = {0: 10, 1: 50, 2: 75, 3: 100, 4: 125}

trainedforest = RandomForestClassifier(criterion = crit[best['criterion']], 
                                      max_depth = best['max_depth'], 
                                      max_features = feat[best['max_features']], 
                                      min_samples_leaf = best['min_samples_leaf'], 
                                      min_samples_split = best['min_samples_split'], 
                                      n_estimators = est[best['n_estimators']]
                                     ).fit(X_Train,Y_Train)

hopt_acc = accuracy_score(Y_Test,hopt_pf)
print(hopt_acc)

# 결과: 0.9983146659176293

#5 Optuna

  • Optuna is a widely used framework in Kaggle competitions (perhaps even more than HyperOpt).
  • Optuna is a framework that automates hyperparameter optimization tasks.
  • Advantages:
    • Broad versatility, being compatible with almost all ML/DL frameworks.
    • Simple and fast.
    • Includes various cutting-edge optimization algorithms.
    • Supports parallel processing.
    • Enables visualization.
  • To understand Optuna, we need to be familiar with the following terms:
    • Study: Optimization based on an objective function
    • Trial: Execution of the objective function
  • In simple terms, a Study is the process of optimization, and Trials refer to the number of times the objective function is executed with different combinations.
  • The purpose of a Study can be said to be finding the optimal hyperparameter combination through multiple trials.
import optuna

# 1. 최소화/최대화할 목적함수 정의
def objective(trial):
    iris = sklearn.datasets.load_iris()
    x, y = iris.data, iris.target

# 2. trial object로 하이퍼파라미터 값 추천
# 다양한 분류모델을 설정해서 비교할 수 있다.
    classifier_name = trial.suggest_categorical('classifier', ['SVC', 'RandomForest'])
    #분류 모델이 SVC일 때
    if classifier_name == 'SVC':
        svc_c = trial.suggest_loguniform('svc_c', 1e-10, 1e10)
        classifier_obj = sklearn.svm.SVC(C=svc_c, gamma='auto')
    
    #분류모델이 랜덤포레스트일 때
    else:
        rf_max_depth = int(trial.suggest_loguniform('rf_max_depth', 2, 32))
        classifier_obj = sklearn.ensemble.RandomForestClassifier(max_depth=rf_max_depth, n_estimators=10)
    
    accuracy = cross_val_score(classifier_obj, x, y, cv = 4).mean()
    return accuracy

# 3. study 오브젝트 생성하고 목적함수 최적화하는 단계
# 여기서는 목적함수를 정확도로 설정했기 때문에 최대화를 목표로 하고 있지만, 손실함수의 경우 direction='minimize'로 설정
study = optuna.create_study(direction='maximize')
# 반복 시행 횟수(trial)는 200번으로
study.optimize(objective, n_trials=200)
# 시행된 trial 중 최적의 하이퍼파라미터 반환하는 메소드
print(study.best_trial.params)

# 시행된 trial 중 가장 높은 값 반환하는 메소드
optuna_acc = study.best_trial.value
print(optuna_acc)

# result: 
''' {'classifier': 'SVC', 'svc_c': 5.8796966956898995}

0.9735064011379801 '''
# Optuna 내장 시각화 코드
# 하이퍼파라미터별 중요도를 확인할 수 있는 그래프
optuna.visualization.plot_param_importances(study)

# 하이퍼파라미터 최적화 과정을 확인
optuna.visualization.plot_optimization_history(study)

 

 

Reference

 

심층 신경망 개선하기: 하이퍼파라미터 튜닝, 정규화 및 최적화

DeepLearning.AI에서 제공합니다. 딥러닝 스페셜라이제이션의 두 번째 과정에서는 딥러닝 블랙박스를 열어 성능을 향상시키고 좋은 결과를 도출하는 프로세스를 체계적으로 이해합니다. 딥러닝 애

www.coursera.org

 

하이퍼파라미터 튜닝

캐글 노트북으로 하이퍼파라미터 튜닝 공부

velog.io

 

Automated Hyperparameter Tuning

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

 


Excellence is not a destination; it is a continuous journey that never ends.
- Conor Mcgregor -
반응형