极客时间-轻松学习，高效学习-极客邦

在路上
2021-09-22
佳哥好，交叉验证会得到多个模型，在预测数据时，是把多个模型的预测值加总求平均值吗？
作者回复: 交叉验证一般并不是用于训练模型，进行预测，而是求出多次的评估分数，把评估分数求均值。确定模型（或算法）的效果。如果用这个思想，得到多个模型，然后把预测值求均值，那就是“集成学习”了，并不算是交叉验证啦。
3
Siyige2727
2021-09-22
老师好，我按照下面的设置，跑出的结果，为什么验证集上的分数很低呢？ GridSearchCV(cv=3, estimator=RandomForestRegressor(), n_jobs=10, param_grid={'bootstrap': [True, False], 'criterion': ['mse', 'mae'], 'max_depth': [3, 5, 6, 10, 12, None], 'max_features': ['auto', 'sqrt'], 'min_samples_leaf': [2, 3, 5, 7, 10], 'min_samples_split': [2, 5, 8, 10], 'n_estimators': [50]}, scoring='r2', verbose=1) 训练集上的R平方分数-调参后的随机森林: 0.8541 测试集上的R平方分数-调参后的随机森林: 0.0481
作者回复: 这是有可能的，因为我们的数据集其实数据是比较少的，每次拆分训练数据集和测试数据集的情况很不一样。而且随机森林是一个有随机性的模型每一次的返回结果也不同。你修改一下拆分数据时的种子值（random_state），重跑一次，看看还是这么低么？也可以尝试一下不同的参数值的组合（可以上网搜索一下合适的值），我给出的不一定最好。
共 2 条评论
3
Matthew
2023-06-04 来自江苏
作业3的代码： # 创建模型 from sklearn.ensemble import RandomForestRegressor model_rfr = RandomForestRegressor() # 对随机森林算法进行参数优化 rfr_param_grid = {'bootstrap': [True, False], 'max_depth': [10, 50, 100, None], 'max_features': ['auto', 'sqrt'], 'min_samples_leaf': [1, 2, 4], 'min_samples_split': [2, 5, 10], 'n_estimators': [50, 500, 2000]} # 导入网格搜索工具 from sklearn.model_selection import GridSearchCV model_rfr_gs = GridSearchCV(model_rfr, param_grid = rfr_param_grid, cv = 3, scoring="r2", n_jobs= 10, verbose = 1) model_rfr_gs.fit(X_train, y_train) # 用优化后的参数拟合训练数据集 print(" GridSearchCV 最佳参数组合:", model_rfr_gs.best_params_) # 导入随机搜索工具 from sklearn.model_selection import RandomizedSearchCV model_rfr_rs = RandomizedSearchCV(model_rfr, param_distributions = rfr_param_grid, cv = 3, n_iter = 10, scoring="r2", n_jobs= 10, verbose = 1) model_rfr_rs.fit(X_train, y_train) # 用优化后的参数拟合训练数据集 print(" RandomizedSearchCV 最佳参数组合:", model_rfr_rs.best_params_) from sklearn.metrics import r2_score, median_absolute_error #导入Sklearn评估模块 print("GridSearchCV：") print('训练集上的R平方分数-调参后的随机森林: %0.4f' % r2_score(y_train, model_rfr_gs.predict(X_train))) print('测试集上的R平方分数-调参后的随机森林: %0.4f' % r2_score(y_test, model_rfr_gs.predict(X_test))) print("RandomizedSearchCV：") print('训练集上的R平方分数-调参后的随机森林: %0.4f' % r2_score(y_train, model_rfr_rs.predict(X_train))) print('测试集上的R平方分数-调参后的随机森林: %0.4f' % r2_score(y_test, model_rfr_rs.predict(X_test)))
展开
作者回复: 好的
Matthew
2023-06-03 来自江苏
# 留一法分折 LeaveOneOut loo = LeaveOneOut() for fold_, (train_index, test_index) in enumerate(loo.split(df_LTV)): X_train = df_LTV.iloc[train_index].drop(['年度LTV'],axis=1) #训练集X X_test = df_LTV.iloc[test_index].drop(['年度LTV'],axis=1) #验证集X y_train = df_LTV.iloc[train_index]['年度LTV'] #训练集y y_test = df_LTV.loc[test_index]['年度LTV'] #验证集y model_lr.fit(X_train, y_train) #训练模型 # print(f"第{fold_}折验证集R2分数：{r2_score(y_test, model_lr.predict(X_test))}") print(f"第{fold_}折验证集的真值：{y_test.values[0]} ,预测值：{model_lr.predict(X_test)[0]}") # 留多法分折 LeavePOut lpo = LeavePOut(p=10) for fold_, (train_index, test_index) in enumerate(lpo.split(df_LTV)): X_train = df_LTV.iloc[train_index].drop(['年度LTV'],axis=1) #训练集X X_test = df_LTV.iloc[test_index].drop(['年度LTV'],axis=1) #验证集X y_train = df_LTV.iloc[train_index]['年度LTV'] #训练集y y_test = df_LTV.loc[test_index]['年度LTV'] #验证集y model_lr.fit(X_train, y_train) #训练模型 print(f"第{fold_}折验证集R2分数：{r2_score(y_test, model_lr.predict(X_test))}")
展开
作者回复: 好
Matthew
2023-06-03 来自江苏
作业2的代码： # 创建模型 from sklearn.linear_model import LinearRegression model_lr = LinearRegression() # 导入K折工具 from sklearn.model_selection import KFold from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import LeavePOut # 导入R2分数评估工具 from sklearn.metrics import r2_score # 普通的 KFold 方法 kf5 = KFold(n_splits=5, shuffle=False) #5折 for fold_, (train_index, test_index) in enumerate(kf5.split(df_LTV)): # print(train_index, test_index) X_train = df_LTV.iloc[train_index].drop(['年度LTV'],axis=1) #训练集X X_test = df_LTV.iloc[test_index].drop(['年度LTV'],axis=1) #验证集X y_train = df_LTV.iloc[train_index]['年度LTV'] #训练集y y_test = df_LTV.loc[test_index]['年度LTV'] #验证集y model_lr.fit(X_train, y_train) #训练模型 print(f"第{fold_}折验证集R2分数：{r2_score(y_test, model_lr.predict(X_test))}") # 重复 K 折 RepeatedKFold rkf5 = RepeatedKFold(n_splits=5, n_repeats=10) # 5折，重复10次 for fold_, (train_index, test_index) in enumerate(rkf5.split(df_LTV)): X_train = df_LTV.iloc[train_index].drop(['年度LTV'],axis=1) #训练集X X_test = df_LTV.iloc[test_index].drop(['年度LTV'],axis=1) #验证集X y_train = df_LTV.iloc[train_index]['年度LTV'] #训练集y y_test = df_LTV.loc[test_index]['年度LTV'] #验证集y model_lr.fit(X_train, y_train) #训练模型 print(f"第{fold_}折验证集R2分数：{r2_score(y_test, model_lr.predict(X_test))}")
展开
作者回复: 好的
Matthew
2023-06-02 来自江苏
作业1的代码和结果： # 创建模型 from sklearn.linear_model import Lasso model_lasso = Lasso() #创建Lasso回归模型 # 交叉验证 from sklearn.model_selection import cross_validate # 导入交叉验证工具 cv = 3 scores = cross_validate(model_lasso, X, y, cv=cv, scoring=('r2', 'neg_mean_squared_error'), return_train_score=True) for i in range(cv): print(f"第{i+1}折验证集的fit_time：{scores['fit_time'][i]} ") print(f"第{i+1}折验证集的score_time：{scores['score_time'][i]} ") print(f"第{i+1}折验证集的test_r2：{scores['test_r2'][i]} ") print(f"第{i+1}折验证集的train_r2：{scores['train_r2'][i]} ") print(f"第{i+1}折验证集的test_neg_mean_squared_error：{-scores['test_neg_mean_squared_error'][i]} ") print(f"第{i+1}折验证集的train_neg_mean_squared_error：{-scores['train_neg_mean_squared_error'][i]} ") print("\n") -------------------------------------------------------------------- 第1折验证集的fit_time：0.000865936279296875 第1折验证集的score_time：0.0006310939788818359 第1折验证集的test_r2：0.5509442176019284 第1折验证集的train_r2：0.49295429721273365 第1折验证集的test_neg_mean_squared_error：36374278.87442617 第1折验证集的train_neg_mean_squared_error：3931052.5016323677 第2折验证集的fit_time：0.0009672641754150391 第2折验证集的score_time：0.0004417896270751953 第2折验证集的test_r2：-0.547095575701684 第2折验证集的train_r2：0.756077640106148 第2折验证集的test_neg_mean_squared_error：19259422.630117264 第2折验证集的train_neg_mean_squared_error：10686185.198631434 第3折验证集的fit_time：0.000530242919921875 第3折验证集的score_time：0.0004150867462158203 第3折验证集的test_r2：0.0975132893938776 第3折验证集的train_r2：0.6571299228574952 第3折验证集的test_neg_mean_squared_error：2236906.201517424 第3折验证集的train_neg_mean_squared_error：16312807.147477873
展开
作者回复: ✌️