project_name = "german-credit-risk"

from urllib.request import urlretrieve
urlretrieve('http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data', 'german.data')

('german.data', <http.client.HTTPMessage at 0x7effdfbb5550>)

import pandas as pd
german_df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data', 
                        delimiter=' ',header=None)

german_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1000 non-null   object
 1   1       1000 non-null   int64 
 2   2       1000 non-null   object
 3   3       1000 non-null   object
 4   4       1000 non-null   int64 
 5   5       1000 non-null   object
 6   6       1000 non-null   object
 7   7       1000 non-null   int64 
 8   8       1000 non-null   object
 9   9       1000 non-null   object
 10  10      1000 non-null   int64 
 11  11      1000 non-null   object
 12  12      1000 non-null   int64 
 13  13      1000 non-null   object
 14  14      1000 non-null   object
 15  15      1000 non-null   int64 
 16  16      1000 non-null   object
 17  17      1000 non-null   int64 
 18  18      1000 non-null   object
 19  19      1000 non-null   object
 20  20      1000 non-null   int64 
dtypes: int64(8), object(13)
memory usage: 164.2+ KB

urlretrieve('http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.doc', 'german.doc')
f = open('german.doc')
german_doc= f.read()
print(german_doc)

Description of the German credit dataset.

1. Title: German Credit data

2. Source Information

Professor Dr. Hans Hofmann  
Institut f"ur Statistik und "Okonometrie  
Universit"at Hamburg  
FB Wirtschaftswissenschaften  
Von-Melle-Park 5    
2000 Hamburg 13 

3. Number of Instances:  1000

Two datasets are provided.  the original dataset, in the form provided
by Prof. Hofmann, contains categorical/symbolic attributes and
is in the file "german.data".   
 
For algorithms that need numerical attributes, Strathclyde University 
produced the file "german.data-numeric".  This file has been edited 
and several indicator variables added to make it suitable for 
algorithms which cannot cope with categorical variables.   Several
attributes that are ordered categorical (such as attribute 17) have
been coded as integer.    This was the form used by StatLog.


6. Number of Attributes german: 20 (7 numerical, 13 categorical)
   Number of Attributes german.numer: 24 (24 numerical)


7.  Attribute description for german

Attribute 1:  (qualitative)
	       Status of existing checking account
               A11 :      ... <    0 DM
	       A12 : 0 <= ... <  200 DM
	       A13 :      ... >= 200 DM /
		     salary assignments for at least 1 year
               A14 : no checking account

Attribute 2:  (numerical)
	      Duration in month

Attribute 3:  (qualitative)
	      Credit history
	      A30 : no credits taken/
		    all credits paid back duly
              A31 : all credits at this bank paid back duly
	      A32 : existing credits paid back duly till now
              A33 : delay in paying off in the past
	      A34 : critical account/
		    other credits existing (not at this bank)

Attribute 4:  (qualitative)
	      Purpose
	      A40 : car (new)
	      A41 : car (used)
	      A42 : furniture/equipment
	      A43 : radio/television
	      A44 : domestic appliances
	      A45 : repairs
	      A46 : education
	      A47 : (vacation - does not exist?)
	      A48 : retraining
	      A49 : business
	      A410 : others

Attribute 5:  (numerical)
	      Credit amount

Attibute 6:  (qualitative)
	      Savings account/bonds
	      A61 :          ... <  100 DM
	      A62 :   100 <= ... <  500 DM
	      A63 :   500 <= ... < 1000 DM
	      A64 :          .. >= 1000 DM
              A65 :   unknown/ no savings account

Attribute 7:  (qualitative)
	      Present employment since
	      A71 : unemployed
	      A72 :       ... < 1 year
	      A73 : 1  <= ... < 4 years  
	      A74 : 4  <= ... < 7 years
	      A75 :       .. >= 7 years

Attribute 8:  (numerical)
	      Installment rate in percentage of disposable income

Attribute 9:  (qualitative)
	      Personal status and sex
	      A91 : male   : divorced/separated
	      A92 : female : divorced/separated/married
              A93 : male   : single
	      A94 : male   : married/widowed
	      A95 : female : single

Attribute 10: (qualitative)
	      Other debtors / guarantors
	      A101 : none
	      A102 : co-applicant
	      A103 : guarantor

Attribute 11: (numerical)
	      Present residence since

Attribute 12: (qualitative)
	      Property
	      A121 : real estate
	      A122 : if not A121 : building society savings agreement/
				   life insurance
              A123 : if not A121/A122 : car or other, not in attribute 6
	      A124 : unknown / no property

Attribute 13: (numerical)
	      Age in years

Attribute 14: (qualitative)
	      Other installment plans 
	      A141 : bank
	      A142 : stores
	      A143 : none

Attribute 15: (qualitative)
	      Housing
	      A151 : rent
	      A152 : own
	      A153 : for free

Attribute 16: (numerical)
              Number of existing credits at this bank

Attribute 17: (qualitative)
	      Job
	      A171 : unemployed/ unskilled  - non-resident
	      A172 : unskilled - resident
	      A173 : skilled employee / official
	      A174 : management/ self-employed/
		     highly qualified employee/ officer

Attribute 18: (numerical)
	      Number of people being liable to provide maintenance for

Attribute 19: (qualitative)
	      Telephone
	      A191 : none
	      A192 : yes, registered under the customers name

Attribute 20: (qualitative)
	      foreign worker
	      A201 : yes
	      A202 : no


8.  Cost Matrix

This dataset requires use of a cost matrix (see below)


      1        2
----------------------------
  1   0        1
-----------------------
  2   5        0

(1 = Good,  2 = Bad)

the rows represent the actual classification and the columns
the predicted classification.

It is worse to class a customer as good when they are bad (5), 
than it is to class a customer as bad when they are good (1).

german_df.columns=['account_bal','duration','payment_status','purpose',
                   'credit_amount','savings_bond_value','employed_since',
                   'intallment_rate','sex_marital','guarantor','residence_since',
                   'most_valuable_asset','age','concurrent_credits','type_of_housing',
                   'number_of_existcr','job','number_of_dependents','telephon',
                   'foreign','target']

german_df= german_df.replace(['A11','A12','A13','A14', 'A171','A172','A173','A174','A121','A122','A123','A124'],
                  ['neg_bal','positive_bal','positive_bal','no_acc','unskilled','unskilled','skilled','highly_skilled',
                   'none','car','life_insurance','real_estate'])

# import libraries for visualizations

import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

# check for missing values
german_df.isna().any().any()

False

german_df.target.unique()

array([1, 2])

from sklearn.preprocessing import LabelEncoder

le= LabelEncoder()
le.fit(german_df.target)
german_df.target=le.transform(german_df.target)
german_df.target.head(5)

0    0
1    1
2    0
3    0
4    1
Name: target, dtype: int64

good_bad_per=round(((german_df.target.value_counts()/german_df.target.count())*100))
good_bad_per
plt.pie(good_bad_per,labels=['Good loans', 'Bad loans'], autopct='%1.0f%%', startangle=90)
plt.title('Percentage of good and bad loans');

german_df[['credit_amount','duration','age']].describe()

german_df['credit_amount']=np.log(german_df['credit_amount'])

german_df[['credit_amount','duration','age']].describe()

# histograms of continues variables

fig, axes = plt.subplots(1,3, figsize=(16,8))
plt.suptitle('Histogram of continuous variables')
axes[0].hist(german_df['duration'])
axes[0].set_xlabel('No. of observations')
axes[0].set_ylabel('Years')
axes[0].set_title('Histogram of loan duration');

axes[1].hist(german_df['credit_amount'])
axes[1].set_xlabel('No. of observations')
axes[1].set_ylabel('Credit amount (dollars)')
axes[1].set_title('Histogram of Credit amount');

axes[2].hist(german_df['age'])
axes[2].set_xlabel('No. of observations')
axes[2].set_ylabel('Age')
axes[2].set_title('Histogram of Age');

# box-plots of continues variables

fig, ax = plt.subplots(1,3,figsize=(20,5))
plt.suptitle('BOX PLOTS')
sns.boxplot(german_df['credit_amount'], ax=ax[0]);
sns.boxplot(german_df['duration'], ax=ax[1], color='salmon');
sns.boxplot(german_df['age'], ax=ax[2], color='darkviolet');

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

sns.scatterplot(y=german_df.credit_amount, 
                x=german_df.duration, 
                hue=german_df.target, 
                s=100, 
                );

german_df.groupby('job')['target'].value_counts().unstack(level=1).plot.barh(stacked=True, figsize=(10, 6))

<matplotlib.axes._subplots.AxesSubplot at 0x7effccf2f210>

sns.lineplot(data=german_df, x='duration', y='credit_amount', hue='target', palette='deep');

german_df.groupby('most_valuable_asset')['target'].value_counts().unstack(level=1).plot.barh(stacked=True, figsize=(10, 6))

<matplotlib.axes._subplots.AxesSubplot at 0x7effcce61d90>

sns.scatterplot(y=german_df.credit_amount, 
                x=german_df.most_valuable_asset, 
                hue=german_df.target, 
                s=100, 
                );

# Number of unique classes in each object column
german_df.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

account_bal             3
payment_status          5
purpose                10
savings_bond_value      5
employed_since          5
sex_marital             4
guarantor               3
most_valuable_asset     4
concurrent_credits      3
type_of_housing         3
job                     3
telephon                2
foreign                 2
dtype: int64

# sklearn preprocessing for dealing with categorical variables
from sklearn.preprocessing import LabelEncoder

# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in german_df:
    if german_df[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(german_df[col].unique())) <= 2:
            # Train on the training data
            le.fit(german_df[col])
            # Transform both training and testing data
            german_df[col] = le.transform(german_df[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

2 columns were label encoded.

# one-hot encoding of categorical variables
german_df = pd.get_dummies(german_df)

print('Encoded Features shape: ', german_df.shape)

Encoded Features shape:  (1000, 58)

# Find correlations with the target and sort
correlations = german_df.corr()['target'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

Most Positive Correlations:
 sex_marital_A92                    0.075493
type_of_housing_A153               0.081556
account_bal_positive_bal           0.089895
type_of_housing_A151               0.092785
concurrent_credits_A141            0.096510
purpose_A40                        0.096900
employed_since_A72                 0.106397
credit_amount                      0.109570
most_valuable_asset_real_estate    0.125750
payment_status_A31                 0.134448
payment_status_A30                 0.144767
savings_bond_value_A61             0.161007
duration                           0.214927
account_bal_neg_bal                0.258333
target                             1.000000
Name: target, dtype: float64

Most Negative Correlations:
 account_bal_no_acc         -0.322436
payment_status_A34         -0.181713
type_of_housing_A152       -0.134589
savings_bond_value_A65     -0.129238
most_valuable_asset_none   -0.119300
concurrent_credits_A143    -0.113285
purpose_A43                -0.106922
purpose_A41                -0.099791
age                        -0.091127
savings_bond_value_A64     -0.085749
foreign                    -0.082079
sex_marital_A93            -0.080677
employed_since_A74         -0.075980
savings_bond_value_A63     -0.070954
employed_since_A75         -0.059733
Name: target, dtype: float64

# Extract the significantly correlated variables
corr_data = german_df[['target', 'account_bal_neg_bal','duration','account_bal_no_acc']]
corr_data_corrs = corr_data.corr()


# Heatmap of correlations
sns.heatmap(corr_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation Heatmap');

# Make a new dataframe for polynomial features
poly_features = german_df[['duration','account_bal_neg_bal','account_bal_no_acc']]
poly_target=german_df['target']

from sklearn.preprocessing import PolynomialFeatures
                                  
# Create the polynomial object with specified degree
poly_transformer = PolynomialFeatures(degree = 2)
# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
print('Polynomial Features shape: ', poly_features.shape)

Polynomial Features shape:  (1000, 10)

poly_transformer.get_feature_names(input_features = ['duration','account_bal_neg_bal','account_bal_no_acc'])

['1',
 'duration',
 'account_bal_neg_bal',
 'account_bal_no_acc',
 'duration^2',
 'duration account_bal_neg_bal',
 'duration account_bal_no_acc',
 'account_bal_neg_bal^2',
 'account_bal_neg_bal account_bal_no_acc',
 'account_bal_no_acc^2']

# Create a dataframe for polynomial features 
poly_features = pd.DataFrame(
    poly_features, columns = poly_transformer.get_feature_names(
        ['duration','account_bal_neg_bal','account_bal_no_acc']))

# Add in the target
poly_features['target'] = poly_target

# Find the correlations with the target
poly_corrs = poly_features.corr()['target'].sort_values()

# Display the correlations
poly_corrs

account_bal_no_acc                       -0.322436
account_bal_no_acc^2                     -0.322436
duration account_bal_no_acc              -0.232697
duration^2                                0.200996
duration                                  0.214927
account_bal_neg_bal                       0.258333
account_bal_neg_bal^2                     0.258333
duration account_bal_neg_bal              0.303343
target                                    1.000000
1                                              NaN
account_bal_neg_bal account_bal_no_acc         NaN
Name: target, dtype: float64

list(poly_features)

['1',
 'duration',
 'account_bal_neg_bal',
 'account_bal_no_acc',
 'duration^2',
 'duration account_bal_neg_bal',
 'duration account_bal_no_acc',
 'account_bal_neg_bal^2',
 'account_bal_neg_bal account_bal_no_acc',
 'account_bal_no_acc^2',
 'target']

# deleting duplicate columns in poly_features

for i in list(poly_features.columns):
  for j in list(german_df.columns):
    if (i==j):
      poly_features.drop(labels=i, axis=1, inplace=True)

poly_features.drop(labels='1', axis=1, inplace=True)
list(poly_features)

['duration^2',
 'duration account_bal_neg_bal',
 'duration account_bal_no_acc',
 'account_bal_neg_bal^2',
 'account_bal_neg_bal account_bal_no_acc',
 'account_bal_no_acc^2']

# Print shape of original german_df
print('Original features shape: ', german_df.shape)

# Merge polnomial features into the dataframe
german_df_poly = german_df.merge(poly_features, left_index=True, right_index=True, how = 'left')

# Print out the new shapes
print('Merged polynomial features shape: ', german_df_poly.shape)

Original features shape:  (1000, 58)
Merged polynomial features shape:  (1000, 64)

german_df_poly.isna().any().any()

False

list(german_df_poly)

['duration',
 'credit_amount',
 'intallment_rate',
 'residence_since',
 'age',
 'number_of_existcr',
 'number_of_dependents',
 'telephon',
 'foreign',
 'target',
 'account_bal_neg_bal',
 'account_bal_no_acc',
 'account_bal_positive_bal',
 'payment_status_A30',
 'payment_status_A31',
 'payment_status_A32',
 'payment_status_A33',
 'payment_status_A34',
 'purpose_A40',
 'purpose_A41',
 'purpose_A410',
 'purpose_A42',
 'purpose_A43',
 'purpose_A44',
 'purpose_A45',
 'purpose_A46',
 'purpose_A48',
 'purpose_A49',
 'savings_bond_value_A61',
 'savings_bond_value_A62',
 'savings_bond_value_A63',
 'savings_bond_value_A64',
 'savings_bond_value_A65',
 'employed_since_A71',
 'employed_since_A72',
 'employed_since_A73',
 'employed_since_A74',
 'employed_since_A75',
 'sex_marital_A91',
 'sex_marital_A92',
 'sex_marital_A93',
 'sex_marital_A94',
 'guarantor_A101',
 'guarantor_A102',
 'guarantor_A103',
 'most_valuable_asset_car',
 'most_valuable_asset_life_insurance',
 'most_valuable_asset_none',
 'most_valuable_asset_real_estate',
 'concurrent_credits_A141',
 'concurrent_credits_A142',
 'concurrent_credits_A143',
 'type_of_housing_A151',
 'type_of_housing_A152',
 'type_of_housing_A153',
 'job_highly_skilled',
 'job_skilled',
 'job_unskilled',
 'duration^2',
 'duration account_bal_neg_bal',
 'duration account_bal_no_acc',
 'account_bal_neg_bal^2',
 'account_bal_neg_bal account_bal_no_acc',
 'account_bal_no_acc^2']

from sklearn.model_selection import train_test_split

x, y = german_df.drop('target', axis=1), german_df['target']
x.shape, y.shape

((1000, 57), (1000,))

x_train, x_test, y_train, y_test= train_test_split(x,y, test_size=.2, random_state=42, stratify=y)

x_train.shape, x_test.shape

((800, 57), (200, 57))

x_train

y_train

828    1
997    0
148    0
735    0
130    0
      ..
492    0
545    1
298    0
417    0
749    0
Name: target, Length: 800, dtype: int64

# train and test split on german_df_poly
X, Y = german_df_poly.drop('target', axis=1), german_df_poly['target']
print(X.shape, Y.shape)

X_train, X_test, Y_train, Y_test= train_test_split(X,Y, test_size=.2, random_state=42, stratify=Y)

(1000, 63) (1000,)

y.value_counts(normalize=True)

0    0.7
1    0.3
Name: target, dtype: float64

# import packages, functions, and classes
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import roc_auc_score, recall_score, classification_report
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_validate

# prepare models
models = []
models.append(('DT', DecisionTreeClassifier(random_state=42)))
models.append(('LR', LogisticRegression(random_state=42)))
models.append(('RF', RandomForestClassifier(random_state=42)))
models.append(('NB', GaussianNB())) 
models.append(('XGB', XGBClassifier(random_state=42)))
models.append(('KNN', KNeighborsClassifier())) 
models.append(('SVM', SVC(gamma='auto',random_state=42)))

# evaluate each model in turn
results_recall = []
results_roc_auc= []
names = []
# recall= tp/ (tp+fn). Best value=1, worst value=0
scoring = ['recall', 'roc_auc']

for name, model in models:
        # split dataset into k folds. use one fold for validation and remaining k-1 folds for training
        skf= StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
        # Evaluate a score by cross-validation. Returns array of scores of the model for each run of the cross validation.
        #cv_results = cross_val_score(model, x_train, y_train, cv=skf, scoring=scoring)
        cv_results = cross_validate(model, x_train, y_train, cv=skf, scoring=scoring)
        results_recall.append(cv_results['test_recall'])
        results_roc_auc.append(cv_results['test_roc_auc'])
        names.append(name)
        msg = "%s- recall:%f roc_auc:%f" % (name, cv_results['test_recall'].mean(),cv_results['test_roc_auc'].mean())
        print(msg)
        
# boxplot algorithm comparison
#recall score plots
fig = plt.figure(figsize=(11,6))
fig.suptitle('Recall scoring Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_recall, showmeans=True)
ax.set_xticklabels(names)
plt.show();
# roc auc score box plots
fig = plt.figure(figsize=(11,6))
fig.suptitle('AUC scoring Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results_roc_auc, showmeans=True)
ax.set_xticklabels(names)
plt.show();

from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, recall_score
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.model_selection import GridSearchCV

# Logistic model on data=german_df
log_reg = LogisticRegression(C = 0.0001, random_state=42)
log_reg.fit(x_train, y_train)
print('Dataset: german_df\n Recall score on train set: {:.2f}'.format(cross_val_score(log_reg, x_train, y_train, cv=skf, scoring='recall').mean()))

# Logistic model on data=german_df_poly
log_reg_poly = LogisticRegression(C = 0.0001, random_state=42, solver='lbfgs', max_iter=1000)
log_reg_poly.fit(X_train, Y_train)
print('Dataset: german_df_poly\n Recall score on train set: {:.2f}'.format(cross_val_score(log_reg_poly, X_train, Y_train, cv=skf, scoring='recall').mean()))

'''
# Evaluate on test dataset
recall_test= recall_score(y_test,log_reg.predict(x_test))
roc_test=roc_auc_score(y_test,log_reg.predict_proba(x_test)[:, 1])
print('LR',' recall_test:', round(recall_test,2),' auc_roc_test:', round(roc_test,2))
tuned_models_test.append(('LR',' recall_test:', round(recall_test,2),' auc_roc_test:', round(roc_test,2)))

# Evaluate on train dataset
roc_train= cross_val_score(log_reg, x_train, y_train, cv=skf, scoring='roc_auc').mean()
recall_train= cross_val_score(log_reg, x_train, y_train, cv=skf, scoring='recall').mean()
print('LR',' recall_train:', round(recall_train,2),' auc_roc_train:', round(roc_train,2))
tuned_models_train.append(('LR',' recall_train:', round(recall_train,2),' auc_roc_train:', round(roc_train,2)))
print(classification_report(y_test, log_reg.predict(x_test)))
'''

Dataset: german_df
 Recall score on train set: 0.01
Dataset: german_df_poly
 Recall score on train set: 0.18

"\n# Evaluate on test dataset\nrecall_test= recall_score(y_test,log_reg.predict(x_test))\nroc_test=roc_auc_score(y_test,log_reg.predict_proba(x_test)[:, 1])\nprint('LR',' recall_test:', round(recall_test,2),' auc_roc_test:', round(roc_test,2))\ntuned_models_test.append(('LR',' recall_test:', round(recall_test,2),' auc_roc_test:', round(roc_test,2)))\n\n# Evaluate on train dataset\nroc_train= cross_val_score(log_reg, x_train, y_train, cv=skf, scoring='roc_auc').mean()\nrecall_train= cross_val_score(log_reg, x_train, y_train, cv=skf, scoring='recall').mean()\nprint('LR',' recall_train:', round(recall_train,2),' auc_roc_train:', round(roc_train,2))\ntuned_models_train.append(('LR',' recall_train:', round(recall_train,2),' auc_roc_train:', round(roc_train,2)))\nprint(classification_report(y_test, log_reg.predict(x_test)))\n"

C=[1,0.1,0.01,0.001,0.0001]

# Create lists to save the values of recall score on training and test sets
train_recall = []
test_recall = []

for c in C:
  log_reg_poly = LogisticRegression(C = c, random_state=42, solver='lbfgs', max_iter=5000).fit(X_train, Y_train)
  train_recall.append(cross_val_score(log_reg_poly, X_test, Y_test, cv=skf,scoring='recall').mean())
  test_recall.append(recall_score(Y_test, log_reg_poly.predict(X_test)))

#print("Best recall score on test dataset is {:.2f}% with C={} ".format(max(test_recall)*100, C[np.argmax(test_recall)]))

# plot graph of recall score vs. C
plt.plot(C, train_recall, label='train')
plt.plot(C, test_recall, label='test')
plt.legend()
plt.xlabel('C')
plt.ylabel('Recall score')
plt.title('Logistic regression: recall score vs C');

test_recall

[0.4166666666666667,
 0.38333333333333336,
 0.4166666666666667,
 0.31666666666666665,
 0.26666666666666666]

# Final logistc model
log_reg_poly = LogisticRegression(C = 0.01, random_state=42, solver='lbfgs', max_iter=5000).fit(X_train, Y_train)
print('Recall score on test dataset: {:.2f}'.format(recall_score(Y_test, log_reg_poly.predict(X_test))))
print('\nAUC ROC score on test dataset: {:.2f}'.format(roc_auc_score(Y_test,log_reg_poly.predict_proba(X_test)[:, 1])))

Recall score on test dataset: 0.42

AUC ROC score on test dataset: 0.78

# Create lists to save the values of accuracy on training and test sets
train_acc = []
test_acc = []
trees_grid = [5, 10, 15, 20, 30, 50, 75, 100]

for ntrees in trees_grid:
    rf = RandomForestClassifier(n_estimators=ntrees, random_state=42, n_jobs=-1).fit(X_train, Y_train)
    train_acc.append(cross_val_score(rf, X_test, Y_test, cv=skf, scoring='recall').mean())
    test_acc.append(recall_score(rf.predict(X_test), Y_test))
print("Best recall score on test dataset is {:.2f}% with {} trees".format(max(test_acc)*100, 
                                                         trees_grid[np.argmax(test_acc)]))
# plot
plt.plot(trees_grid, train_acc, label='train')
plt.plot(trees_grid, test_acc, label='test')
plt.legend()
plt.xlabel('No. of trees (n_estimators)')
plt.ylabel('Accuracies (Recall score)')
plt.title('Random-Forest: accuracy vs n_estimators');

Best recall score on test dataset is 68.29% with 20 trees

# Initialize the set of parameters for exhaustive search and fit 
parameters = {'max_features': [7, 10, 16, 18], 
              'min_samples_leaf': [1, 3, 5, 7], 
              'max_depth': [15, 20, 24, 27]}
rf = RandomForestClassifier(n_estimators=20, random_state=42, n_jobs=-1)
gcv = GridSearchCV(rf, parameters, n_jobs=-1, cv=skf, verbose=1, scoring='recall')
gcv_fit= gcv.fit(X_train, Y_train)

Fitting 10 folds for each of 64 candidates, totalling 640 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   22.1s
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:   49.8s
[Parallel(n_jobs=-1)]: Done 640 out of 640 | elapsed:  1.2min finished

# evaluate on train and test datasets
cross_val= cross_val_score(gcv_fit.best_estimator_, X_train, Y_train, cv=skf,scoring='recall').mean()
recall_sc=recall_score(Y_test,gcv_fit.best_estimator_.predict(X_test))
roc_sc= roc_auc_score(Y_test,gcv_fit.best_estimator_.predict_proba(X_test)[:, 1])
# print scores
print("GCV best parameters: ",gcv.best_params_)
print("GCV best score: {:.2f}".format(gcv.best_score_)) 
print("Cross val score on train dataset: {:.2f}".format(cross_val))
print("Recall score on test dataset: {:.2f}".format(recall_sc))
print("ROC AUC score on test dataset: {:.2f}".format(roc_sc))

GCV best parameters:  {'max_depth': 20, 'max_features': 18, 'min_samples_leaf': 3}
GCV best score: 0.48
Cross val score on train dataset: 0.48
Recall score on test dataset: 0.40
ROC AUC score on test dataset: 0.74

# model
gnb= GaussianNB()
gnb.fit(X_train, Y_train)
#evaluate
print('Train accuracy: {:.2f}'.format(cross_val_score(gnb, X_train, Y_train, cv=skf, scoring='recall').mean()))
print('Recall score test dataset: {:.2f}'.format(recall_score(Y_test, gnb.predict(X_test))))
print('ROC AUC score test dataset: {:.2f}'.format(roc_auc_score(Y_test, gnb.predict_proba(X_test)[:,1])))

Train accuracy: 0.67
Recall score test dataset: 0.67
ROC AUC score test dataset: 0.78

# Create lists to save the values of accuracy on training and test sets
train_acc = []
test_acc = []
estimator_grid = [5, 10, 15, 20, 30, 50, 75, 100]

for nestimator in estimator_grid:
    xg = XGBClassifier(n_estimators=nestimator, random_state=42, n_jobs=-1).fit(X_train, Y_train)
    train_acc.append(cross_val_score(xg, X_test, Y_test, cv=skf, scoring='recall').mean())
    test_acc.append(recall_score(xg.predict(X_test), Y_test))
print("Best recall score on test dataset is {:.2f}% with {} estimators".format(max(test_acc)*100, 
                                                         estimator_grid[np.argmax(test_acc)]))
# plot
plt.plot(estimator_grid, train_acc, label='train')
plt.plot(estimator_grid, test_acc, label='test')
plt.legend()
plt.xlabel('No. of trees (n_estimators)')
plt.ylabel('Accuracies (Recall score)')
plt.title('XG-Boost: accuracy vs n_estimators');

Best recall score on test dataset is 66.67% with 10 estimators

#model
model_xg = XGBClassifier(random_state=42)
model_xg.fit(X_train, Y_train)
#evaluate
print('Train accuracy:', cross_val_score(model_xg, X_train, Y_train, cv=skf, scoring='recall').mean())
print('Recall score on test dataset:', recall_score(Y_test, model_xg.predict(X_test)))
print('ROC AUC score on test dataset:', roc_auc_score(Y_test, model_xg.predict_proba(X_test)[:,1]))

Train accuracy: 0.45
Recall score on test dataset: 0.5
ROC AUC score on test dataset: 0.7957142857142858

# Initialize the set of parameters for exhaustive search and fit 
parameters = {'max_features': [7, 10, 16, 18], 
              'min_samples_leaf': [1, 3, 5, 7], 
              'max_depth': [15, 20, 24, 27]}
model_xg = XGBClassifier(n_estimators=10, random_state=42, n_jobs=-1)
gcv = GridSearchCV(model_xg, parameters, n_jobs=-1, cv=skf, verbose=1, scoring='recall')
gcv_fit= gcv.fit(X_train, Y_train)

Fitting 10 folds for each of 64 candidates, totalling 640 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  88 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 388 tasks      | elapsed:   20.1s
[Parallel(n_jobs=-1)]: Done 640 out of 640 | elapsed:   33.0s finished

gcv.best_params_, gcv.best_score_

({'max_depth': 15, 'max_features': 7, 'min_samples_leaf': 1}, 0.4625)

# evaluate on train and test datasets
cross_val= cross_val_score(gcv_fit.best_estimator_, X_train, Y_train, cv=skf,scoring='recall').mean()
recall_sc=recall_score(Y_test,gcv_fit.best_estimator_.predict(X_test))
roc_sc= roc_auc_score(Y_test,gcv_fit.best_estimator_.predict_proba(X_test)[:, 1])
# print scores
print("GCV best parameters: ",gcv.best_params_)
print("GCV best score: {:.2f}".format(gcv.best_score_)) 
print("Cross val score on train dataset: {:.2f}".format(cross_val))
print("Recall score on test dataset: {:.2f}".format(recall_sc))
print("ROC AUC score on test dataset: {:.2f}".format(roc_sc))

GCV best parameters:  {'max_depth': 15, 'max_features': 7, 'min_samples_leaf': 1}
GCV best score: 0.46
Cross val score on train dataset: 0.46
Recall score on test dataset: 0.43
ROC AUC score on test dataset: 0.73

from sklearn.neighbors import KNeighborsClassifier
knn= KNeighborsClassifier(n_neighbors = 1)
knn.fit(X_train, Y_train)
#evaluate
print('Train accuracy:{:.2f}'.format(cross_val_score(knn, X_train, Y_train, cv=skf, scoring='recall').mean()))
print('Recall score on test dataset:{:.2f}'.format(recall_score(Y_test, knn.predict(X_test))))
print("ROC AUC score on test dataset: {:.2f}".format(roc_auc_score(Y_test,gcv_fit.best_estimator_.predict_proba(X_test)[:, 1])))

Train accuracy:0.49
Recall score on test dataset:0.35
ROC AUC score on test dataset: 0.73

import jovian

jovian.commit()

import jovian

jovian.commit()

[jovian] Detected Colab notebook...
[jovian] Uploading colab notebook to Jovian...
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ai/aswiniabraham/german-credit-risk-blog

'https://jovian.ai/aswiniabraham/german-credit-risk-blog'

	credit_amount	duration	age
count	1000.000000	1000.000000	1000.000000
mean	3271.258000	20.903000	35.546000
std	2822.736876	12.058814	11.375469
min	250.000000	4.000000	19.000000
25%	1365.500000	12.000000	27.000000
50%	2319.500000	18.000000	33.000000
75%	3972.250000	24.000000	42.000000
max	18424.000000	72.000000	75.000000

	credit_amount	duration	age
count	1000.000000	1000.000000	1000.000000
mean	7.788691	20.903000	35.546000
std	0.776474	12.058814	11.375469
min	5.521461	4.000000	19.000000
25%	7.219276	12.000000	27.000000
50%	7.749107	18.000000	33.000000
75%	8.287088	24.000000	42.000000
max	9.821409	72.000000	75.000000

German Credit Data Analysis

1. Import data¶

2. Data Preparation and Cleaning¶

3. Exploratory Data Analysis and Visualization¶

4. Encode categorical variables¶

5. Feature engineering¶

Polynomial Features¶

6. Data split to train and test datasets¶

7. Models¶

7.2 Logistic regression¶

7.3 Random Forest¶

7.4 GaussianNB¶

7.5 XGBoost¶

7.6 KNN¶

Summary¶

References and Future Work¶

	duration	credit_amount	intallment_rate	residence_since	age	number_of_existcr	number_of_dependents	telephon	foreign	account_bal_neg_bal	account_bal_no_acc	account_bal_positive_bal	payment_status_A30	payment_status_A31	payment_status_A32	payment_status_A33	payment_status_A34	purpose_A40	purpose_A41	purpose_A410	purpose_A42	purpose_A43	purpose_A44	purpose_A45	purpose_A46	purpose_A48	purpose_A49	savings_bond_value_A61	savings_bond_value_A62	savings_bond_value_A63	savings_bond_value_A64	savings_bond_value_A65	employed_since_A71	employed_since_A72	employed_since_A73	employed_since_A74	employed_since_A75	sex_marital_A91	sex_marital_A92	sex_marital_A93	sex_marital_A94	guarantor_A101	guarantor_A102	guarantor_A103	most_valuable_asset_car	most_valuable_asset_life_insurance	most_valuable_asset_none	most_valuable_asset_real_estate	concurrent_credits_A141	concurrent_credits_A142	concurrent_credits_A143	type_of_housing_A151	type_of_housing_A152	type_of_housing_A153	job_highly_skilled	job_skilled	job_unskilled
828	36	9.028219	3	4	47	1	1	0	0	1	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	1	0	1	0	0	0	0	0	1	0	0	1	0	0	1	0	1	0
997	12	6.689599	4	4	38	1	1	0	0	0	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	1	0	1	0	0	0	1	0	0	0	0	1	0	1	0	0	1	0
148	36	8.588769	3	2	28	2	1	0	0	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	1	1	0	0	0	0	0	1	0	1	0	0	1	0
735	36	8.291547	3	2	29	1	1	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	0	1	0	0	0	0	0	1	1	0	0	0	1	0	0	0	1
130	48	9.046291	1	2	24	1	1	0	0	0	0	1	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	1	0	0	1	0	0	0	1	0	0	0	0	1	0	1	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
492	6	7.120444	1	1	27	2	1	0	0	0	1	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	1	0	0	1	0	0	1	0	0	0	0	0	1	0	1	0	0	1	0
545	24	7.195187	4	2	43	2	2	0	0	1	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	1	0	0	1	0	1	0
298	18	7.830028	3	4	43	1	1	1	0	0	1	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	1	0	1	0	0	1	0
417	18	9.044404	1	2	23	2	1	1	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	1	0	0	0	1	0	0	1	0	0	0	1	0	0	0	0	1	1	0	0	0	1	0
749	15	8.015988	2	2	33	1	1	0	0	0	1	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	0	1	0	1	0	0	0	1	0	0	0	0	1	0	1	0	0	1	0