一、CatBoost简介
CatBoost是一种基于梯度提升树(グラディエントブースティング)算法的机器学习框架。CatBoost最初由俄罗斯搜索引擎Yandex的工程师开发,支持分类和回归任务,并支持特征类别(cateogorical features)。
与XGBoost和LightGBM类似,CatBoost使用梯度提升树算法,其主要特点是能够自适应学习率(adaptive learning rate)和统计学习。
二、CatBoost与XGBoost、LightGBM比较
1. 训练速度
在CatBoost发布之前,XGBoost和LightGBM是最常用的梯度提升树框架。但是,CatBoost在训练速度方面表现出色,特别是在特征是非数字类型时,CatBoost的表现更优。
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier from lightgbm import LGBMClassifier import xgboost as xgb import time X, y = make_classification(n_samples=100000, n_features=200, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) start_time = time.time() model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6) model.fit(X_train, y_train) print(f"Time taken by CatBoost : {time.time()-start_time:.2f} seconds") start_time = time.time() model = LGBMClassifier(num_iterations=1000, learning_rate=0.1, max_depth=6, num_leaves=31) model.fit(X_train, y_train) print(f"Time taken by LightGBM : {time.time()-start_time:.2f} seconds") start_time = time.time() dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) param = {'max_depth': 6, 'eta': 0.1, 'objective': 'binary:logistic'} num_round = 1000 model = xgb.train(param, dtrain, num_round) print(f"Time taken by XGBoost : {time.time()-start_time:.2f} seconds")
2. 过拟合的处理
过拟合是机器学习领域的一个常见问题,对于训练数据过度拟合会使模型对于新的数据的预测效果变差,而XGBoost和LightGBM在解决过拟合问题上都需要额外的手动调整(early stopping和正则化),而CatBoost拥有自己独特的解决方式,称为"random”。
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier from lightgbm import LGBMClassifier import xgboost as xgb X, y = make_classification(n_samples=100000, n_features=200, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # XGBoost的过拟合处理 dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) param = {'max_depth': 6, 'eta': 0.1, 'objective': 'binary:logistic'} num_round = 1000 evallist = [(dtest, 'eval'), (dtrain, 'train')] model = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=10) # LightGBM的过拟合处理 model = LGBMClassifier(num_iterations=1000, learning_rate=0.1, max_depth=6, num_leaves=31, objective='binary', reg_alpha=1, reg_lambda=1) model.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=10) # CatBoost的自动随机过拟合处理 model = CatBoostClassifier(iterations=1000, loss_function='MultiClass', eval_metric='MultiClass', random_strength=0.1, l2_leaf_reg=4) model.fit(X_train, y_train, eval_set=(X_test, y_test), use_best_model=True, plot=True)
3. 处理分类特征
CatBoost可以方便地处理分类特征。基础算法无法像LightGBM和XGBoost一样处理分类特征,导致在特征是分类特征时表现不佳。为此,CatBoost使用了一个分类特征编码器(CatBoostEncoder),用基础算法替换类别特征。
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier, CatBoostEncoder X, y = make_classification(n_samples=100000, n_features=200, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, cat_features=list(range(0, 20))) encoder = CatBoostEncoder() encoder.fit(X_train, y_train) X_train_enc = encoder.transform(X_train) X_test_enc = encoder.transform(X_test) model.fit(X_train_enc, y_train, eval_set=(X_test_enc, y_test), verbose=False, plot=True)
4. Adaboost的改进
Adaboost是一种流行的分类算法,但它只能使用单个基本学习器。因此,CatBoost使用多棵树构建Adaboost模型,提高了模型的准确性。与传统Adaboost不同的是,CatBoost使用不同的学习率,以平衡整个模型。
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from catboost import CatBoostClassifier from sklearn.ensemble import AdaBoostClassifier X, y = make_classification(n_samples=100000, n_features=20, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) adaboost_clf = AdaBoostClassifier(random_state=42) adaboost_clf.fit(X_train, y_train) model = CatBoostClassifier(loss_function='Logloss', iterations=100, random_strength=0.1, max_depth=2, learning_rate=0.1) model.fit(X_train, y_train, eval_set=(X_test, y_test), verbose=False, plot=True)