首发于数据分析
python数据分析实例(十一)Thera Bank信贷业务数据

python数据分析实例(十一)Thera Bank信贷业务数据

Seaborn直方图、箱型图、柱状图、热度图、小提琴图、LV图,Matplotlib 直方图、折线图,常用函数:

print(df.groupby('CD Account')['Personal Loan'].agg([np.mean,np.std]))
sns.catplot('CD Account',hue='Personal Loan',data=df,kind='count')
一,导入模块.读取数据
二,对每一个工作经验为负值的样本,用同年龄和教育程度的工作经验均值替代
三,Plot the distribution for each variable绘制每个变量的分布
四,计算各变量之间的相关性 (筛掉无关的布尔变量数据CreditCard、Online、Securities_Account)
五,探索强弱6变量与Personal Loan的关系,并总结重点营销策略
六,将数据(重新执行步骤一、二)按70:30的比例分成训练集和测试集
七,Logistic模型的训练与预测 ,模型评估,交叉验证
八,K-NN模型的训练与预测 , 模型评估,交叉验证
九,贝叶斯模型的训练与预测 , 模型评估,交叉验证
十,结论:考虑Precision score,3种模型中K-NN模型似乎是最优的
十一,更多的模型
十二,变量分箱后的模型

一,导入模块.读取数据

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
sns.set(style="ticks")

import warnings
warnings.filterwarnings('ignore') 
# 不发出警告

import os
os.chdir('C:/Users/Administrator/Desktop/')
df=pd.read_excel('Bank_Personal_Loan_Modelling-1.xlsx')
df1 = df.copy()

youhabahe1 = df.head()
df.info()
youhabahe2 = df.describe().transpose()
df.apply(lambda x: len(x.unique()))


二,对每一个工作经验为负值的样本,用同年龄和教育程度的工作经验均值替代


# Check for missing values检查缺少的值
df.isnull().values.any()

#对每一个工作经验为负值的样本,用同年龄和教育程度的工作经验均值替代
data_S=df[df['Experience']>0]
data_V=df['Experience']<0
mylist_V = df[data_V]['ID'].tolist()
for id_V in mylist_V:
    age_V=df[df['ID']==id_V]['Age'].tolist()[0]
    education_V = df[df['ID']==id_V]['Education'].tolist()[0]
    df_filtered_V = data_S[data_S['Age']== age_V][data_S['Education']== education_V]
    exp_V=df_filtered_V['Experience'].median()
    if exp_V is np.nan:
        exp_V=0
#        exp_V=df['Experience'].median()           
    df.loc[df[df['ID']==id_V].index,'Experience'] = exp_V
print(df['Experience'].describe())
sns.distplot(df['Experience'])

三,Plot the distribution for each variable绘制每个变量的分布

df.hist(figsize=(15,20),bins = 50,xlabelsize=10,ylabelsize =10)
plt.savefig('每个变量的分布.jpg', dpi=200)
plt.show()


fig,axes = plt.subplots(4,3,figsize=(15,20))
plt.subplots_adjust(wspace = 0.3,hspace=0.3)


sns.boxplot(x="Personal Loan", y="Age", hue="Personal Loan", data=df, palette="PRGn",ax = axes[0,0])
sns.boxplot(x="Personal Loan", y="CCAvg", hue="Personal Loan", data=df, palette="PRGn",ax = axes[0,1])
sns.boxplot(x="Personal Loan", y="CD Account", hue="Personal Loan", data=df, palette="PRGn",ax = axes[0,2])
sns.boxplot(x="Personal Loan", y="CreditCard", hue="Personal Loan", data=df, palette="PRGn",ax = axes[1,0])
sns.boxplot(x="Personal Loan", y="Education", hue="Personal Loan", data=df, palette="PRGn",ax = axes[1,1])
sns.boxplot(x="Personal Loan", y="Experience", hue="Personal Loan", data=df, palette="PRGn",ax = axes[1,2])
sns.boxplot(x="Personal Loan", y="Family", hue="Personal Loan", data=df, palette="PRGn",ax = axes[2,0])
sns.boxplot(x="Personal Loan", y="Income", hue="Personal Loan", data=df, palette="PRGn",ax = axes[2,1])
sns.boxplot(x="Personal Loan", y="Mortgage", hue="Personal Loan", data=df, palette="PRGn",ax = axes[2,2])
sns.boxplot(x="Personal Loan", y="Online", hue="Personal Loan", data=df, palette="PRGn",ax = axes[3,0])
sns.boxplot(x="Personal Loan", y="Securities Account", hue="Personal Loan", data=df, palette="PRGn",ax = axes[3,1])
sns.boxplot(x="Personal Loan", y="ZIP Code", hue="Personal Loan", data=df, palette="PRGn",ax = axes[3,2])

plt.savefig('每个变量的分布箱型图.jpg', dpi=200)


特征解释:

ID:客户账号ID,无实际意义

Age:客户年龄,数值变量

Experience:工作经验,数值变量

Income:年收入,数值变量

ZIP_Code:客户所在地邮编,需要当作字符串处理

Family:家庭人数,离散型数值变量

CCAvg:每月信用卡还款额,数值变量

Education:受教育程度,离散数值变量,1为未毕业,2为毕业生,3为高级毕业生

Mortgage:抵押贷款数,数值变量

Personal_Loan:是否参加这次抵押贷款,布尔变量

Securities_Account:在本银行是否有秘密账户,布尔变量

CD_Account:在本银行是否有存款账户,布尔变量

Online:是否开通网上银行,布尔变量

CreditCard:是否有信用卡,布尔变量



四,计算各变量之间的相关性 (筛掉无关的布尔变量数据CreditCard、Online、Securities_Account)

# Check the correlation检查相关
df.corr()

# 个人贷款与其他自变量之间的相关性
df.corr()["Personal Loan"]

#个人贷款是因变量,看看它的分布
df.groupby(['Personal Loan']).size()

f, ax = plt.subplots(figsize=(16, 12))
sns.set()
sns.heatmap(data.drop('ID',axis=1).corr(),annot=True,cmap='YlGnBu')
plt.savefig('各变量之间的相关性.jpg', dpi=200)



从图中可以看出:

1、和开通信贷强相关的变量有:收入,信用卡还款额及是否有该银行存单账户;

2、和开通信贷弱相关的变量有:受教育程度,房屋抵押贷款数,家庭人数;

3、邮编、是否是私密账户,是否开通网上银行及是否有信用卡,关系都不大;

4、因为年龄、工作经验都是连续的数值变量,所以需要分箱后再做观察,看看是否有某一段存在特殊值。



重点关注: CCAvg(0.37):每月信用卡还款额,数值变量

Income(0.5):年收入,数值变量

CD_Account(0.32):在本银行是否有存款账户,布尔变量


五,探索强弱6变量与Personal Loan的关系

5.1,x=CD Account(在本银行是否有存款账户),y=Personal Loan(参加这次抵押贷款人数)

print(df.groupby('CD Account')['Personal Loan'].agg([np.mean,np.std]))
sns.catplot('CD Account',hue='Personal Loan',data=df,kind='count')

结论:可以看出,开通了银行存单账户的客户,其申请贷款的可能性是没有开通的6倍多。

策略:让尽可能多的客户开通该银行存款账单是一个提高申请贷款率的可能选项。

5.2,x=Education(受教育程度),y=Personal Loan(参加这次抵押贷款人数)

#受教育程度对于申请贷款业务的影响
print(df.groupby('Education')['Personal Loan'].agg([np.mean,np.std]))
sns.catplot('Education',hue='Personal Loan',data=df,kind='count')

结论:从图中可以看出,没有大学学位的人,对于申请贷款的意愿明显低于受过高等教育的人,但是收到高等教育之后,更高的学历对于申请贷款的可能性的提高很小。

策略:提升贷款申请率的可能方式可以是,尽可能是争取更多的高学历客户。

5.3,x=Family(家庭人数),y=Personal Loan(参加这次抵押贷款人数)

#家庭人数对于开通贷款业务的影响
print(df.groupby('Family')['Personal Loan'].agg([np.mean,np.std]))
sns.catplot('Family',hue='Personal Loan',data=df,kind='count')

结论:当家庭人数只有1或者2个人的时候,申请贷款的可能性较低,而当人数达到三人时,申请贷款率猛增,当家庭人数达到四人时,申请贷款率又下降了,很有趣的现象,这种情况下需要结合其他的因素一起观察,将在下一节进行讨论。

策略:可以着重加强对于家庭人数为3的客户的宣传。


5.4,x=Income分箱(收入),y=Personal Loan率(参加这次抵押贷款人数)折线图

print(df.groupby('Personal Loan')['Income'].agg(['count',np.mean,np.std]))
sns.catplot('Personal Loan','Income',data=df,kind='violin')
sns.catplot('Personal Loan','Income',data=df,kind='boxen')


从两张图中可以看出,高收入的人明显比低收入的人更愿意申请贷款,这和我们的常识也是相符的。但是这张图,有一个缺陷,就是不能看出,各个收入段的申请贷款的比率,所以我们需要再进行一些处理。

#查看贷款开通率随收入增长的变化
df['Income_Bins']=pd.qcut(df['Income'],20)
print(df.groupby('Income_Bins')['Personal Loan'].agg({'mean':np.mean,'count':'count'}))
df.groupby('Income_Bins')['Personal Loan'].agg({'Loan_rate':np.mean}).plot()


结论:从图中可以看出,当年收入超过82时,贷款意愿会有5倍以上的大幅上升,超过98时,贷款意愿会有将近三倍的大幅提升,而且收入越高,贷款的意愿越强。

策略:可以重点关注年收入大于98的顾客,他们的贷款意愿是极强的。



5.5,x=CCAvg分箱(每月信用卡还款额),y=Personal Loan率(参加这次抵押贷款人数)折线图

#贷款申请率随信用卡还款额和房屋抵押贷款的变化
sns.catplot('Personal Loan','CCAvg',data=df,kind='boxen')
df['CCAvg_Bins']=pd.qcut(df['CCAvg'],20)
print(df.groupby('CCAvg_Bins')['Personal Loan'].agg({'mean':np.mean,'count':'count'}))
df.groupby('CCAvg_Bins')['Personal Loan'].agg({'Loan_rate':np.mean}).plot()


结论:当每月信用卡还款额大于2.8时,申请贷款率会增大将近4倍。

策略:可以将宣传的重点,放在信用卡还款额大于2.8的客户中。


5.6,x=Mortgage分箱(抵押贷款数),y=Personal Loan率(参加这次抵押贷款人数)折线图

sns.catplot('Personal Loan','Mortgage',data=df,kind='boxen')
df['Mortgage_Bins']=pd.cut(df.Mortgage,10)
print(df.groupby('Mortgage_Bins')['Personal Loan'].agg({'mean':np.mean,'count':'count'}))
df.groupby('Mortgage_Bins')['Personal Loan'].agg({'Loan_rate':np.mean}).plot()


结论:当房屋抵押值大于254时,贷款申请的意愿有明显的提升,当大于508时,一半以上的人,都会申请房屋抵押贷款。

策略:可以根据这三段,采取不同的营销策略。




总结重点策略: CCAvg(0.37):每月信用卡还款额,数值变量(信用卡还款额大于2.8的客户)

Income(0.5):年收入,数值变量(年收入大于98的顾客)

CD_Account(0.32):在本银行是否有存款账户,布尔变量(让尽可能多的客户开通该银行存款账单)



六,将数据(重新执行步骤一、二)按70:30的比例分成训练集和测试集

### Comments : Most of the people dont have Personal loan. Only very few have personal Loans将数据集分割为培训和测试数据集
### Split dataset into training & test dataset  

X = df.drop(['Personal Loan','ID'],axis=1)
y = df['Personal Loan']
# Split the data into training and test set in the ratio of 70:30 respectively将数据按70:30的比例分成训练集和测试集
from sklearn.model_selection import train_test_split

#Feature Scaling
from scipy.stats import zscore
X =X.apply(zscore)

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=1)


七,Logistic模型的训练与预测 ,模型评估,交叉验证

### Logistic Model 模型的训练与预测 
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
model = logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
predictions

### Logistic模型评估
from sklearn.metrics import classification_report,confusion_matrix
from sklearn import metrics
print(classification_report(y_test,predictions))
confusion_matrix(y_test,predictions)
sns.heatmap(confusion_matrix(y_test,predictions), annot=True, cmap='Blues',fmt='g')
print(metrics.accuracy_score(y_test,predictions))


### K-fold cross validation交叉验证
# Necessary imports: for a k fold cross validation必要的导入:用于k折叠交叉验证
from sklearn.model_selection import cross_val_score, cross_val_predict
# Perform 10-fold cross validation执行10倍交叉验证
scores = cross_val_score(model, X_test, y_test, cv=10)
print('Cross-validated scores:', scores)
predictions = cross_val_predict(model, X_test, y_test, cv=10)
print(classification_report(y_test, predictions))
print(metrics.accuracy_score(y_test,predictions))

八,K-NN模型的训练与预测 ,模型评估,交叉验证

### Training and Predicting of K-NN              K-NN的训练与预测
#### printing the accuracy of the model with different values of k=3,4,5   打印模型的精度与k值的不同为3、4、5
# Call Nearest Neighbour algorithm调用最近邻算法
from sklearn.neighbors import KNeighborsClassifier
NNH1 = KNeighborsClassifier(n_neighbors= 3 , weights = 'uniform', metric='euclidean')
NNH1.fit(X_train, y_train)

predicted_labels_NNH1 = NNH1.predict(X_test)

print(metrics.accuracy_score(y_test,predicted_labels_NNH1))

NNH2 = KNeighborsClassifier(n_neighbors= 4 , weights = 'uniform', metric='euclidean')
NNH2.fit(X_train, y_train)

predicted_labels_NNH2 = NNH2.predict(X_test)


print(metrics.accuracy_score(y_test,predicted_labels_NNH2))


NNH3 = KNeighborsClassifier(n_neighbors= 5 , weights = 'uniform', metric='euclidean')
NNH3.fit(X_train, y_train)

predicted_labels_NNH3 = NNH3.predict(X_test)

print(metrics.accuracy_score(y_test,predicted_labels_NNH3))



### Evaluation of K-NN  基于事例的评价
# show Confusion Matrix显示混淆矩阵
print(metrics.confusion_matrix(y_test,predicted_labels_NNH1))

#Show precision and Recall metrics显示精确度和召回度量
print(metrics.classification_report(y_test,predicted_labels_NNH1))

sns.heatmap(confusion_matrix(y_test,predicted_labels_NNH1), annot=True, cmap='Blues',fmt='g')


# show Confusion Matrix显示混淆矩阵
print(metrics.confusion_matrix(y_test,predicted_labels_NNH2))

#Show precision and Recall metrics显示精确度和召回度量
print(metrics.classification_report(y_test,predicted_labels_NNH2))

# show Confusion Matrix显示混淆矩阵
print(metrics.confusion_matrix(y_test,predicted_labels_NNH3))

#Show precision and Recall metrics显示精确度和召回度量
print(metrics.classification_report(y_test,predicted_labels_NNH3))




### Cross Validation
from sklearn.model_selection import GridSearchCV

k = np.arange(1,11)
knn = KNeighborsClassifier()
parameters = {'n_neighbors': k}
GS = GridSearchCV(knn,parameters,cv=10)

GS.fit(X_train,y_train)
GS.predict(X_test)

GS.best_score_

GS.best_estimator_

GS.cv_results_


# try K=1 through K=10 and record testing accuracy
k_range = range(1, 11)

# We can create Python dictionary using [] or dict()
scores = []
MSE=[]
# We append the scores in the dictionary
for k in k_range:
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        scores.append(metrics.accuracy_score(y_test, y_pred))
        MSE = 1-metrics.accuracy_score(y_test, y_pred)
        print("MSE for model with neighbour k = ",k,MSE)

print(scores)

print("Model with lowest MSE ",MSE.min())



九,贝叶斯模型的训练与预测 ,模型评估,交叉验证

### Training and predicting of NaiveBayes内维尔贝叶斯的训练和预测
#Fit the model
from sklearn.naive_bayes import GaussianNB
NBmodel = GaussianNB()
NBmodel.fit(X_train, y_train)


#Predict
expected = y_test
predicted_NB = NBmodel.predict(X_test)


### Evaluation of NaiveBayes评价NaiveBayes
# show Confusion Matrix
print(metrics.confusion_matrix(expected, predicted_NB))


# show accuracy
print(metrics.accuracy_score(expected,predicted_NB))


#Show precision and Recall metrics
print(metrics.classification_report(expected, predicted_NB))

sns.heatmap(confusion_matrix(expected,predicted_NB), annot=True, cmap='Blues',fmt='g')



print(metrics.accuracy_score(y_test,predicted_NB))



十,结论:考虑Precision score,K-NN模型似乎是最优的

By seeing the target column distribution it is clear that we have only 9.6% of data denotes people accepted the personal loan

Distribution - Personal Loan

0 4520

1 480

Accuracy of Logistic model is 0.947

Accuracy of K-NN model(K=3) is 0.94

Accuracy of NaiveBayes model is 0.87

As you can see both logistic and K-NN seems to have same accuracy and NaiveBayes has less accuracy comparatively

Now seeing the classification report of Logistic and K-NN model

Precision Score of Logistic Model comes to around 98%

Precision Score of K-NN Model comes to around 99%

For Metrics(heatMap) refer the Evaluation section of Logistic and K-NN mode

We are Considering Precision score instead of Recall as we have only less distribution of data for people who accepted personal loan.

By considering Precision score, K-NN Model Seems to be best

accuracy指的是正确预测的样本数占总预测样本数的比值,它不考虑预测的样本是正例还是负例。

precision指的是正确预测的正样本数占所有预测为正样本的数量的比值,也就是说所有预测为正样本的样本中有多少是真正的正样本。从这我们可以看出,precision只关注预测为正样本的部分,而accuracy考虑全部样本。

Recall可以称为召回率、查全率等等...我们也不考究如何翻译它,它指的是正确预测的正样本数占真实正样本总数的比值,也就是我能从这些样本中能够正确找出多少个正样本。

F-score相当于precision和recall的调和平均,用意是要参考两个指标。从公式我们可以看出,recall和precision任何一个数值减小,F-score都会减小,反之,亦然。

十一,更多的模型

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
sns.set(style="ticks")

#模型预处理
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

#二分类评估器
from sklearn.svm import SVC, LinearSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score


import os
os.chdir('C:/Users/Administrator/Desktop/')
data=pd.read_excel('Bank_Personal_Loan_Modelling-1.xlsx')

data1=pd.read_excel('C:\\Users\\Administrator\\Desktop\\Bank_Personal_Loan_Modelling-1.xlsx')
data.columns=['ID', 'Age', 'Experience', 'Income', 'ZIP_Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard']

data.describe().T




data_S=data1[data1['Experience']>0]
data_V=data1['Experience']<0
mylist_V = data1[data_V]['ID'].tolist()
for id_V in mylist_V:
    age_V=data1[data1['ID']==id_V]['Age'].tolist()[0]
    education_V = data1[data1['ID']==id_V]['Education'].tolist()[0]
    df_filtered_V = data_S[data_S['Age']== age_V][data_S['Education']== education_V]
    exp_V=df_filtered_V['Experience'].median()
    if exp_V is np.nan:
        exp_V=data1['Experience'].median()      
    data1.loc[data1[data1['ID']==id_V].index,'Experience'] = exp_V
print(data1['Experience'].describe())
sns.distplot(data1['Experience'])



train = data.drop(['Personal_Loan','ID'],axis=1)
y = data['Personal_Loan']

#Feature Scaling
from scipy.stats import zscore
train =train.apply(zscore)

X_train,X_test,y_train,y_test=train_test_split(train,y,test_size=0.3, random_state=1)

#定义非平衡集的Baseline函数
def baseline_imbalanced_dataset(X_train,y_train,X_test,y_test):
    MLA=[AdaBoostClassifier(),BaggingClassifier(),ExtraTreesClassifier(),\
    GradientBoostingClassifier(),RandomForestClassifier(),\
    GaussianProcessClassifier(),LogisticRegressionCV(),\
    PassiveAggressiveClassifier(),SGDClassifier(),\
    Perceptron(),BernoulliNB(),GaussianNB(),KNeighborsClassifier(),\
    SVC(probability=True),LinearSVC(),DecisionTreeClassifier()]
    MLA_compare=[]
    for alg in MLA:
        MLA_dict={}
        alg.random_state=0
        MLA_name=alg.__class__.__name__
        score=cross_val_score(alg,X_train,y_train,cv=5)
        score_mean=round(score.mean(),4)
        score_std=round(score.std(),4)
        alg.fit(X_train,y_train)
        try:
            auc=roc_auc_score(y_test,alg.predict_proba(X_test)[:,1])
        except:
            auc=roc_auc_score(y_test,alg.decision_function(X_test))
        MLA_dict['Auc']=round(auc,4)
        MLA_dict['name']=MLA_name
        MLA_dict['accuracy']=score_mean
        MLA_dict['std']=score_std
        MLA_compare.append(MLA_dict)
    MLA_df=pd.DataFrame(MLA_compare)
    MLA_df.set_index('name',inplace=True)
    return MLA_df.sort_values('Auc',ascending=False)

baseline_imbalanced_dataset(X_train,y_train,X_test,y_test)

十二,变量分箱后的模型


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import seaborn as sns
sns.set(style="ticks")

#模型预处理
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

#二分类评估器
from sklearn.svm import SVC, LinearSVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score


import os
os.chdir('C:/Users/Administrator/Desktop/')
data=pd.read_excel('Bank_Personal_Loan_Modelling-1.xlsx')

data1=pd.read_excel('C:\\Users\\Administrator\\Desktop\\Bank_Personal_Loan_Modelling-1.xlsx')
data.columns=['ID', 'Age', 'Experience', 'Income', 'ZIP_Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
       'CD_Account', 'Online', 'CreditCard']

data.describe().T




data_S=data1[data1['Experience']>0]
data_V=data1['Experience']<0
mylist_V = data1[data_V]['ID'].tolist()
for id_V in mylist_V:
    age_V=data1[data1['ID']==id_V]['Age'].tolist()[0]
    education_V = data1[data1['ID']==id_V]['Education'].tolist()[0]
    df_filtered_V = data_S[data_S['Age']== age_V][data_S['Education']== education_V]
    exp_V=df_filtered_V['Experience'].median()
    if exp_V is np.nan:
        exp_V=data1['Experience'].median()      
    data1.loc[data1[data1['ID']==id_V].index,'Experience'] = exp_V
print(data1['Experience'].describe())
sns.distplot(data1['Experience'])

data['Income_Bins']=pd.qcut(data.Income,10,labels=range(0,10)).astype(int)
data['CCAvg_Bins']=pd.qcut(data.CCAvg,10,labels=range(0,10)).astype(int)
data['Mortgage_Bins']=pd.cut(data.Mortgage,10,labels=range(0,10)).astype(int)

#因为收入和月信用卡还款额分别在110和3处,存在明显的边界,所以额外生成两个特征
data['Income_110']=np.where(data.Income>110,1,0)
data['CCAvg_3']=np.where(data.CCAvg>3,1,0)


#生成用于模型训练的数据集,及拆分成训练集和测试集
columns=data.columns.drop(['ID','Age','Experience','Income','ZIP_Code','CCAvg','Mortgage','Personal_Loan','Securities_Account','Online','CreditCard'])
train=data[columns]
y=data.Personal_Loan.values
X_train,X_test,y_train,y_test=train_test_split(train,y,test_size=0.3, random_state=1)

print(X_train.shape)
print(y_train.shape)



#定义非平衡集的Baseline函数
def baseline_imbalanced_dataset(X_train,y_train,X_test,y_test):
    MLA=[AdaBoostClassifier(),BaggingClassifier(),ExtraTreesClassifier(),\
    GradientBoostingClassifier(),RandomForestClassifier(),\
    GaussianProcessClassifier(),LogisticRegressionCV(),\
    PassiveAggressiveClassifier(),SGDClassifier(),\
    Perceptron(),BernoulliNB(),GaussianNB(),KNeighborsClassifier(),\
    SVC(probability=True),LinearSVC(),DecisionTreeClassifier()]
    MLA_compare=[]
    for alg in MLA:
        MLA_dict={}
        alg.random_state=0
        MLA_name=alg.__class__.__name__
        score=cross_val_score(alg,X_train,y_train,cv=5)
        score_mean=round(score.mean(),4)
        score_std=round(score.std(),4)
        alg.fit(X_train,y_train)
        try:
            auc=roc_auc_score(y_test,alg.predict_proba(X_test)[:,1])
        except:
            auc=roc_auc_score(y_test,alg.decision_function(X_test))
        MLA_dict['Auc']=round(auc,4)
        MLA_dict['name']=MLA_name
        MLA_dict['accuracy']=score_mean
        MLA_dict['std']=score_std
        MLA_compare.append(MLA_dict)
    MLA_df=pd.DataFrame(MLA_compare)
    MLA_df.set_index('name',inplace=True)
    return MLA_df.sort_values('Auc',ascending=False)

baseline_imbalanced_dataset(X_train,y_train,X_test,y_test)



#定义超参数调节类,调节超参数
class SubestimatorOfStacking():
    '''使用stacking模型融合时,快捷的类
    参数
    ---------------
    alg:estimator
    param_grid:grid of parameter for estimator
    '''
    def __init__(self,alg,param_grid,random_state=0,n_jobs=None):
        self.alg=alg
        self.param_grid=param_grid
        self.random_state=random_state
        self.alg.random_state=self.random_state
        self.n_jobs=n_jobs

    def fit(self,X,y,rfecv=False):
        self.X_train=X
        self.y_train=y
        self.rfecv_=rfecv
        self.grid=GridSearchCV(self.alg,self.param_grid,cv=5)
        self.grid.n_jobs=self.n_jobs
        self.grid.fit(self.X_train,self.y_train)
        self.best_estimator_=self.grid.best_estimator_
        self.best_params_=self.grid.best_params_
        print('The best score after GridSearchCV is '+str(self.grid.best_score_)+'.')
        if self.rfecv_:
            self.rfecv=RFECV(self.best_estimator_,min_features_to_select=int(self.X_train.shape[1]/2),cv=5)
            self.rfecv.fit(self.X_train,self.y_train)
            self.best_features_=self.X_train.columns[self.rfecv.get_support()]
            self.features_=pd.Series(self.rfecv.estimator_.feature_importances_,index=self.best_features_).sort_values(ascending=False)
            print('The best score after RFECV is '+str(self.rfecv.grid_scores_.max())+'.')
            print('The number of selected features is '+str(self.rfecv.n_features_)+'.')
            print('If you want get the top features,please use self.best_features_.')
        else:
            self.features_=self.X_train.columns
        self.cv_results_=pd.DataFrame(self.grid.cv_results_)
        self.cv_results_heatmap_=self.cv_results_.pivot_table(values='mean_test_score',columns=self.cv_results_.columns[4],index=self.cv_results_.columns[5])
        sns.heatmap(self.cv_results_heatmap_,annot=True)
        print('The best params is {}'.format(self.grid.best_params_))
        return self

    def stacking(self,X,y,X_test,NFolds=5,reduce_features=0,from_importancest=False,use_best_features=False):
        self.use_best_features=use_best_features
        self.from_importancest=from_importancest
        self.reduce_features=reduce_features   
        if self.rfecv_:
            self.features_=pd.Series(self.rfecv.estimator_.feature_importances_,index=self.best_features_).sort_values(ascending=self.from_importancest)
            self.stacking_features_len_=len(self.features_.index)-self.reduce_features
            self.columns_=self.features_.index[:self.stacking_features_len_]
            self.X_train=X[self.columns_].values
            self.X_test=X_test[self.columns_].values
        else:
            self.stacking_features_len_=len(self.features_)-self.reduce_features
            self.columns_=self.features_[:self.stacking_features_len_]
            self.X_train=X[self.columns_].values
            self.X_test=X_test[self.columns_].values
        self.y_train=y
        self.NFolds=NFolds
        ntrain=self.X_train.shape[0]
        ntest=self.X_test.shape[0]
        self.oof_train=np.zeros((ntrain,))
        self.oof_test=np.zeros((ntest,))
        oof_test_df=np.empty((self.NFolds,ntest))
        kf=KFold(n_splits=self.NFolds,random_state=self.random_state)

        for i,(train_index,test_index) in enumerate(kf.split(self.X_train)):
            X_tr=self.X_train[train_index]
            y_tr=self.y_train[train_index]
            X_te=self.X_train[test_index]

            self.best_estimator_.fit(X_tr,y_tr)
            y_te=self.best_estimator_.predict(X_te)
            self.oof_train[test_index]=y_te
            oof_test_df[i,:]=self.best_estimator_.predict(X_test)
        self.oof_test=oof_test_df.mean(axis=0)
        self.oof_train=self.oof_train.reshape(-1,1)
        self.oof_test=self.oof_test.reshape(-1,1)
        return self.oof_train,self.oof_test
    def fit_predict(self,X,y,test):
        self.best_estimator_.fit(X,y)
        return self.best_estimator_.predict(test)

#输入三个模型的调参范围
param_grid_GradientBoostingClassifier={'learning_rate':\
[0.005,0.007,0.01,0.02,0.03],'n_estimators':[600,800,1000,1200,1400],\
'random_state':[0]}
param_grid_SVC={'C':[200,250,300,350,400],'gamma':[0.02,0.03,0.04],\
'random_state':[0]}
param_grid_GaussianProcessClassifier={'max_iter_predict':\
[50,100,200,500,1000,2000],'random_state':[0]}


#将三者的参数输入进行网格搜索
gbc=SubestimatorOfStacking(GradientBoostingClassifier(),\
param_grid_GradientBoostingClassifier).fit(X_train,y_train)
svc=SubestimatorOfStacking(SVC(),param_grid_SVC).fit(X_train,y_train)
gpc=SubestimatorOfStacking(GaussianProcessClassifier(),\
param_grid_GaussianProcessClassifier).fit(X_train,y_train)




#看一下三者的AUC值的大小
print('The AUC of GradientBoostingClassifier is '\
+str(roc_auc_score(y_test,gbc.best_estimator_.decision_function(X_test))))
print('The AUC of SVC is '\
+str(roc_auc_score(y_test,svc.best_estimator_.decision_function(X_test))))
print('The AUC of GaussianProcessClassifier is '\
+str(roc_auc_score(y_test,gpc.best_estimator_.predict_proba(X_test)[:,1])))



#创建投票评估器
vc=VotingClassifier([('gbc',gbc.best_estimator_),('gpc',gpc.best_estimator_),('svc',svc.best_estimator_)],weights=[3,1,2],voting='hard')

#构建stacking评估器
gbc_train,gbc_test=gbc.stacking(X_train,y_train,X_test,NFolds=3)
svc_train,svc_test=svc.stacking(X_train,y_train,X_test,NFolds=3)
gpc_train,gpc_test=gpc.stacking(X_train,y_train,X_test,NFolds=3)

#将这三组数据统一到训练和测试集,我们可以看一下,这个的预测结果的相关性,评估的好坏,当然是越不相关越好,因为这样才能让各个模型挖掘出更多的信息
base_train=pd.DataFrame({'gbc':gbc_train.ravel(),'gpc':gpc_train.ravel(),'svc':svc_train.ravel()})
base_test=pd.DataFrame({'gbc':gbc_test.ravel(),'gpc':gpc_test.ravel(),'svc':svc_test.ravel()})

sns.heatmap(base_train.corr(),annot=True)
sns.heatmap(base_test.corr(),annot=True)



#将stacking后的数据集选择新的模型进行训练,参数都是最后一步的范围确定
param_grid_AdaBoostClassifier={'learning_rate':[0.005,0.007,0.01,0.012,0.015],\
'n_estimators':[30,40,50,60,70],'random_state':[0]}

abc_stacking=SubestimatorOfStacking(AdaBoostClassifier(),\
param_grid_AdaBoostClassifier).fit(base_train,y_train)

#比较Gradientboosting,SVC,Gaussianprocess,Voting和Staking五种方式的最后结果。
gbc_score=cross_val_score(gbc.best_estimator_,train,y,cv=10)
svc_score=cross_val_score(svc.best_estimator_,train,y,cv=10)
gpc_score=cross_val_score(gpc.best_estimator_,train,y,cv=10)
vc_score=cross_val_score(vc,train,y,cv=10)
abc_stacking_score=cross_val_score(abc_stacking.best_estimator_,base_train,y_train,cv=10)

scores_df=pd.DataFrame({'GradientBoosting':gbc_score,'SVC':svc_score,'GaussianProcess':gpc_score,'Voting':vc_score,'Stacking':abc_stacking_score})
scores_df.plot(kind='box')

print('The test score of staking is {0:.4f}'.format(abc_stacking.best_estimator_.score(base_test,y_test)))
print('The test score of GradientBoosting is {0:.4f}'.format(gbc.best_estimator_.score(X_test,y_test)))
print('The AUC value of staking is {0:.4f}'.format(roc_auc_score(y_test,abc_stacking.best_estimator_.decision_function(base_test))))
print('The AUC value of GradientBoosting is {0:.4f}'.format(roc_auc_score(y_test,gbc.best_estimator_.decision_function(X_test))))

编辑于 2019-09-01

文章被以下专栏收录