极客时间-轻松学习，高效学习-极客邦

柚子

2019-03-13

老师，使用GridSearchCV 工具多次执行每次结果都不一样，是怎么判定最优分数就是0.9667，最优参数就是6呢？



 17
vortual

2019-03-13

老师，实际工作中数据量大的话跑个模型应该要不少时间，应该不允许这么去试所有参数和那么多算法吧？还有一个疑问是数据量超过一定量是不是要用深度学习了？希望老师能解惑下

作者回复: 深度学习是帮你发现一些不太明显（线性情况下不容易）发现的规律。有时候你可以先用机器学习得到一个baseline，然后再考虑用 NN模型，毕竟深度模型计算量大，有时候还需要GPU



 9
西湖晨曦

2019-09-03

我是银行信用卡部的从业人员，也很喜欢数据分析。

但是看了这个案例，感觉这个案例能够给信用卡的数据分析带来什么呢？我的意思是，能够分析出什么问题吗？银行信用卡部应该在持卡人用卡的什么阶段开始开始要采取措施防止诈骗？什么类型的客户容易诈骗？---感觉这个案例就是从数字到数字，没有能够给真实业务带来什么帮助。

-----也想对从事数据分析的人员提个醒，数据分析不是从纯数字到纯数字的纯学术研究，应该是联系实际工作，能够给实际工作带来帮助的啊！联系到此案例，应该是能够给银行信用卡部的防欺诈工作带来提升的啊~分析了什么出来？银行的哪个环节应该提升以防止欺诈？

展开

作者回复: 用数据分析做分类预测，也就是遇到了其他的用户数据，通过模型进行分类预测，我们有多少准确率可以预测出来他是否是欺诈用户。

 1

 6
跳跳

2019-03-13

1.对GridSearchCV的理解：就是在之前的经验的基础上选择了一些较好的取值备选，然后分别去试，得到一个好的性能。比直接选择参数多了一些保障，但是也增加一些计算负担。
2.在老师代码的基础上添加了adaboost分类，使用adaboost默认的分类器，结果是在n_estimators=10的时候取得最优性能，准确率是0.8187
GridSearch 最优参数： {'AdaBoostClassifier__n_estimators': 10}
GridSearch 最优分数： 0.8187
准确率 0.8129

作者回复: 对的给一个范围让GridSearchCV 利用穷举找到最优



 4
third

2019-03-20

提问：老是出现futureWarning,是什么情况

GridSearch最优参数： {'n_estimators': 10}
GridSearch最优分数： 0.8187
准确率 0.8129

作者回复: 可以屏蔽 warning

 1

 3
王彬成

2019-03-13

GridSearch最优参数： {'n_estimators': 10}
GridSearch最优分数： 0.8187
准确率 0.8129
-----代码------

# -*- coding: utf-8 -*-
# 信用卡违约率分析
import pandas as pd
from sklearn.model_selection import learning_curve, train_test_split,GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier

from matplotlib import pyplot as plt
import seaborn as sns
# 数据加载
data=data=pd.read_csv('./credit_default-master/UCI_Credit_Card.csv')
# 数据探索
print(data.shape) # 查看数据集大小
print(data.describe()) # 数据集概览
# 查看下一个月违约率的情况
next_month = data['default.payment.next.month'].value_counts()
print(next_month)
df = pd.DataFrame({'default.payment.next.month': next_month.index,'values': next_month.values})
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.figure(figsize = (6,6))
plt.title('信用卡违约率客户\n (违约：1，守约：0)')
sns.set_color_codes("pastel")
sns.barplot(x = 'default.payment.next.month', y="values", data=df)
locs, labels = plt.xticks()
plt.show()
# 特征选择，去掉ID字段、最后一个结果字段即可
data.drop(['ID'], inplace=True, axis =1) #ID这个字段没有用
target = data['default.payment.next.month'].values
columns = data.columns.tolist()
columns.remove('default.payment.next.month')
features = data[columns].values
# 30%作为测试集，其余作为训练集
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.30, stratify = target, random_state = 1)

#分类器
ada=AdaBoostClassifier( random_state=1)
#需要调整的参数
parameters={'n_estimators':[10,50,100]}

# 使用 GridSearchCV 进行参数调优
clf=GridSearchCV(estimator=ada,param_grid=parameters,scoring = 'accuracy')

clf.fit(train_x,train_y)
print("GridSearch最优参数：", clf.best_params_)
print("GridSearch最优分数： %0.4lf" %clf.best_score_)
predict_y=clf.predict(test_x)
print("准确率 %0.4lf" %accuracy_score(test_y, predict_y))

展开

作者回复: Good Job



 3
王彬成

2019-03-13

使用 Pipeline 管道机制的优势，参考资料：
https://www.jianshu.com/p/9c2c8c8ef42d
https://blog.csdn.net/qq_41598851/article/details/80957893

个人理解：
Pipeline是将数据处理流程的共同部分提取出来，简化代码。
以本文最后的编程案例为例，共同部分是“数据规范化”和“使用数据分类算法”，将俩部分封装。
在每一次循环“算法”时，pipeline里头完成算法更新。GridSearchCV引用固定的pipeline，实则算法已经更新了。这样减少了多余代码的书写。

展开

作者回复: 感谢分享



 2
JingZ

2019-03-13

# 信用卡违约率分析
KNN相比较而言，跑得最慢了点

from sklearn.ensemble import AdaBoostClassifier

# 构造各种分类器
AdaBoostClassifier(random_state=1)

# 分类器名称
'adaboostclassifier'

# 分类器参数
{'adaboostclassifier__n_estimators': [10, 50, 100]}

结果：

GridSearch 最优分数：0.8187
GridSearch 最优参数: {'adaboostclassifier__n_estimators': 10}
准确率 0.8129

展开



 2
王彬成

2019-03-13

‘GridSearch最优分数’和‘预测数据准确率’是怎么理解的。以下我的理解对吗

我理解是‘GridSearch最优分数’是从【训练数据】中得到的最优准确率。
而‘预测数据准确率’是利用最优模型，分析【测试数据】得到的准确率。



 1
白夜

2019-03-13

三万条，25个字段就要运算几分钟了，数据上亿。。。
'''
GridSearch最优参数： {'svc__C': 1, 'svc__gamma': 0.01}
GridSearch最优分数： 0.8174
准确率 0.8172
68.59484457969666 s
GridSearch最优参数： {'decisiontreeclassifier__max_depth': 6}
GridSearch最优分数： 0.8186
准确率 0.8113
1.8460278511047363 s
GridSearch最优参数： {'randomforestclassifier__n_estimators': 6}
GridSearch最优分数： 0.7998
准确率 0.7994
2.297856330871582 s
GridSearch最优参数： {'kneighborsclassifier__n_neighbors': 8}
GridSearch最优分数： 0.8040
准确率 0.8036
154.36387968063354 s
GridSearch最优参数： {'adaboostclassifier__n_estimators': 10}
GridSearch最优分数： 0.8187
准确率 0.8129
13.483576774597168 s
'''

展开

作者回复: Good Job



 1
佳佳的爸

2020-01-10

程序运行出错了，Python3.7的环境

Traceback (most recent call last):
  File "E:/my_work_spaces/pycharm/Self_learn_projs/Crawler_projs/Data_ana_lesson_src/Credit_card/Creditcard_fraud.py", line 71, in <module>
    ax = sns.countplot(x='Class', data=data)
  File "E:\my_work_spaces\pycharm\venv_py37\lib\site-packages\seaborn\categorical.py", line 3553, in countplot
    errcolor, errwidth, capsize, dodge)
  File "E:\my_work_spaces\pycharm\venv_py37\lib\site-packages\seaborn\categorical.py", line 1607, in __init__
    order, hue_order, units)
  File "E:\my_work_spaces\pycharm\venv_py37\lib\site-packages\seaborn\categorical.py", line 155, in establish_variables
    raise ValueError(err)
ValueError: Could not interpret input 'Class'

展开




Ronnyz

2019-11-30

# -*- coding:utf-8 -*-

import pandas as pd

from sklearn.model_selection import GridSearchCV,train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.datasets import load_iris

from sklearn.metrics import accuracy_score

from sklearn.ensemble import AdaBoostClassifier

from matplotlib import pyplot as plt

import seaborn as sns

from warnings import simplefilter

simplefilter(action='ignore',category=FutureWarning)

#数据加载

credits=pd.read_csv('CreditCard_data/UCI_Credit_Card.csv')

#数据探索

print(credits.shape)

print(credits.describe()) #查看数据概览

#特征选择，去掉ID字段

credits.drop(['ID'],inplace=True,axis=1)

target=credits['default.payment.next.month'].values

columns=credits.columns.tolist()

columns.remove('default.payment.next.month')

features=credits[columns].values

#分割数据，将30%作为测试集

X_train,X_test,y_train,y_test=train_test_split(features,target,test_size=0.3,random_state=666)

#构建分类器

ada=AdaBoostClassifier()

#使用网格搜索调整参数

#参数设置

parameters={

'n_estimators':[10,50,100]

}

gscv=GridSearchCV(estimator=ada,param_grid=parameters,scoring='accuracy',n_jobs=-1)

gscv.fit(X_train,y_train)

print('GridSearch最优参数：',gscv.best_params_)

print('GridSearch最优分数：%0.4lf' % gscv.best_score_)

y_pred=gscv.predict(X_test)

print('准确率：',accuracy_score(y_test,y_pred))

GridSearch最优参数： {'n_estimators': 50}

GridSearch最优分数：0.8197

准确率： 0.8121111111111111

展开

作者回复: Ronnyz同学写的不错




一纸书

2019-11-22

勉勉强强看懂,但心知若让我在一片空白的python文件中,完全独立完成这个项目;我做不到;

作者回复: 慢慢来啊~ 大家都是逐渐积累起来的，加油




孟君

2019-11-07

老师，这个数据集是明显的unbalanced dedataset。需要先进行balance处理吗？我以前处理lending club的dataset，发现经过convert to balance dataset之后，random forest的准确率高了不少

作者回复: 可以先进行balance处理，这样会好一些




许智鸿

2019-11-04

老师，参赛的范围划分有什么依据吗?还是说我下次遇到类似的题目，例如预测用户点击广告的概率，也可以直接套用您这套代码和参数氛围，然后得出最优的分类方案和参数取值?

另外，如果字段过多，需要进行降纬处理吗?怎么处理?

作者回复: 每个比赛除了算法模型，特征工程还是很重要的。是否进行降维，这个还是看业务场景，在比赛中有的时候不需要降维，有时候反而需要构造出来一些新的特征，目的是为了让结果更好。在工程中，可以采用降维方式，毕竟更少的维度可以让计算效率更高




滨滨

2019-04-28

GridSearchCV本质是一种穷举法

作者回复: 是的

 1


滢

2019-04-24

旸老师，想请教几个问题：1.为何执行多次最优分数是一定的 0.9667 但是最优参数，n_estimators 每次都不一样，这是什么原因？2.随机森林是不是正好与AdaBoost相反，都数据集成模式，一个是集成里的投票模式，一个是学习模式。这样理解正确吗？

作者回复: 很好的洞察，随机森林和AdaBoost都是集成学习方法，
Random Forest是Bagging方式，AdaBoost是Boosting方法。Bagging就是投票模型




Geek_dancer

2019-03-30

我发现：classifier_param_grid的keys必须严格按照'classifier_names'+'__'+'classifier_param_name'的格式来写




Destroy、

2019-03-18

ada = AdaBoostClassifier()
parameters = {'adaboostclassifier__n_estimators': [10, 50, 100]}
pipeline = Pipeline([
('scaler', StandardScaler()),
('adaboostclassifier', ada)
])
clf = GridSearchCV(estimator=pipeline, param_grid=parameters)
clf.fit(train_x, train_y)
print('GridSearch最优参数：', clf.best_params_)
print('GridSearch最优分数：%0.4lf' % clf.best_score_)
predict_y = clf.predict(test_x)
print('准确率%0.4lf' % accuracy_score(test_y, predict_y))

GridSearch最优参数： {'adaboostclassifier__n_estimators': 10}
GridSearch最优分数：0.8187
准确率0.8129

展开

作者回复: Good Job




Geek_2a6093

2019-03-13

这次get到了pipeline 和gridsearch真好，谢谢老师！老师下次能不能详细地介绍一下集成方法如Stacking，或者XgBoost这个库的特点呀？

作者回复: xgboost lightgbm都是常用的机器学习工具



