Machine Learning with Python Peer-graded Assignment Solutions

Get All Machine Learning with Python Peer-graded Assignment Solutions

This course dives into the basics of machine learning using an approachable, and well-known programming language, Python.

In this course, we will be reviewing two main components: First, you will be learning about the purpose of Machine Learning and where it applies to the real world. Second, you will get a general overview of Machine Learning topics such as supervised vs unsupervised learning, model evaluation, and Machine Learning algorithms.

Enroll on Coursera

Peer-graded Assignment: The best classifier Instruction

Now that you have been equipped with the skills to use different Machine Learning algorithms, over the course of five weeks, you will have the opportunity to practice and apply it on a dataset. In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.

You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models:

  • k-Nearest Neighbour
  • Decision Tree
  • Support Vector Machine
  • Logistic Regression

The results is reported as the accuracy of each classifier, using the following metrics when these are applicable:

  • Jaccard index
  • F1-score
  • LogLoass

Peer-graded Assignment: The best classifier Solution

Classification with Python

In this notebook we try to practice all the classification algorithms that we learned in this course.

We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.

Lets first load required libraries:

In [1]:

import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline
About dataset

This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

FieldDescription
Loan_statusWhether a loan is paid off on in collection
PrincipalBasic principal loan amount at the
TermsOrigination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_dateWhen the loan got originated and took effects
Due_dateSince it’s one-time payoff schedule, each loan has one single due date
AgeAge of applicant
EducationEducation of applicant
GenderThe gender of applicant

Lets download the dataset

In [2]:

!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv
--2019-05-26 01:24:50--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23101 (23K) [text/csv]
Saving to: ‘loan_train.csv’

100%[======================================>] 23,101      --.-K/s   in 0.002s  

2019-05-26 01:24:50 (12.7 MB/s) - ‘loan_train.csv’ saved [23101/23101]

Load Data From CSV File

In [3]:

df = pd.read_csv('loan_train.csv')
df.head(10)

Out[3]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGender
000PAIDOFF1000309/8/201610/7/201645High School or Belowmale
122PAIDOFF1000309/8/201610/7/201633Bechalorfemale
233PAIDOFF1000159/8/20169/22/201627collegemale
344PAIDOFF1000309/9/201610/8/201628collegefemale
466PAIDOFF1000309/9/201610/8/201629collegemale
577PAIDOFF1000309/9/201610/8/201636collegemale
688PAIDOFF1000309/9/201610/8/201628collegemale
799PAIDOFF800159/10/20169/24/201626collegemale
81010PAIDOFF30079/10/20169/16/201629collegemale
91111PAIDOFF1000159/10/201610/9/201639High School or Belowmale

In [4]:

df.shape

Out[4]:

(346, 10)
Convert to date time object

In [5]:

df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()

Out[5]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGender
000PAIDOFF1000302016-09-082016-10-0745High School or Belowmale
122PAIDOFF1000302016-09-082016-10-0733Bechalorfemale
233PAIDOFF1000152016-09-082016-09-2227collegemale
344PAIDOFF1000302016-09-092016-10-0828collegefemale
466PAIDOFF1000302016-09-092016-10-0829collegemale
Data visualization and pre-processing

Let’s see how many of each class is in our data set

In [6]:

df['loan_status'].value_counts()

Out[6]:

PAIDOFF       260
COLLECTION     86
Name: loan_status, dtype: int64

260 people have paid off the loan on time while 86 have gone into collection

Lets plot some columns to underestand data better:

In [7]:

# notice: installing seaborn might takes a few minutes
# !conda install -c anaconda seaborn -y

In [8]:

import seaborn as sns

bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()

In [9]:

bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")

g.axes[-1].legend()
plt.show()
Pre-processing: Feature selection/extraction
Lets look at the day of the week people get the loan

In [10]:

df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4

In [11]:

df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
df.head()

Out[11]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGenderdayofweekweekend
000PAIDOFF1000302016-09-082016-10-0745High School or Belowmale30
122PAIDOFF1000302016-09-082016-10-0733Bechalorfemale30
233PAIDOFF1000152016-09-082016-09-2227collegemale30
344PAIDOFF1000302016-09-092016-10-0828collegefemale41
466PAIDOFF1000302016-09-092016-10-0829collegemale41
Convert Categorical features to numerical values

Lets look at gender:

In [12]:

df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)

Out[12]:

Gender  loan_status
female  PAIDOFF        0.865385
        COLLECTION     0.134615
male    PAIDOFF        0.731293
        COLLECTION     0.268707
Name: loan_status, dtype: float64

86 % of female pay there loans while only 73 % of males pay there loan

Lets convert male to 0 and female to 1:

In [13]:

df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
df.head()

Out[13]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGenderdayofweekweekend
000PAIDOFF1000302016-09-082016-10-0745High School or Below030
122PAIDOFF1000302016-09-082016-10-0733Bechalor130
233PAIDOFF1000152016-09-082016-09-2227college030
344PAIDOFF1000302016-09-092016-10-0828college141
466PAIDOFF1000302016-09-092016-10-0829college041
One Hot Encoding
How about education?

In [14]:

df.groupby(['education'])['loan_status'].value_counts(normalize=True)

Out[14]:

education             loan_status
Bechalor              PAIDOFF        0.750000
                      COLLECTION     0.250000
High School or Below  PAIDOFF        0.741722
                      COLLECTION     0.258278
Master or Above       COLLECTION     0.500000
                      PAIDOFF        0.500000
college               PAIDOFF        0.765101
                      COLLECTION     0.234899
Name: loan_status, dtype: float64
Feature befor One Hot Encoding

In [15]:

df[['Principal','terms','age','Gender','education']].head()

Out[15]:

PrincipaltermsageGendereducation
0100030450High School or Below
1100030331Bechalor
2100015270college
3100030281college
4100030290college
Use one hot encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame

In [16]:

Feature = df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()

Out[16]:

PrincipaltermsageGenderweekendBechalorHigh School or Belowcollege
01000304500010
11000303310100
21000152700001
31000302811001
41000302901001

Feature selection

Lets defind feature sets, X:

In [17]:

X = Feature
X[0:5]

Out[17]:

PrincipaltermsageGenderweekendBechalorHigh School or Belowcollege
01000304500010
11000303310100
21000152700001
31000302811001
41000302901001

What are our lables?

Replacing “PAIDOFF” and “COLLECTION” strings with 0 and 1 integers because some models can only work with integers or floats

In [18]:

df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[0,1],inplace=True)
df.head()

Out[18]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGenderdayofweekweekend
00001000302016-09-082016-10-0745High School or Below030
12201000302016-09-082016-10-0733Bechalor130
23301000152016-09-082016-09-2227college030
34401000302016-09-092016-10-0828college141
46601000302016-09-092016-10-0829college041

In [19]:

y = df['loan_status'].values
y[0:20]

Out[19]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Doing Train test split –

In [20]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.40, random_state=5)

In [21]:

print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)
Train set: (207, 8) (207,)
Test set: (139, 8) (139,)

In [22]:

display(X_train.shape, y_train.shape)
display(X_test.shape, y_test.shape)
(207, 8)
(207,)
(139, 8)
(139,)
Normalize Data

Data Standardization give data zero mean and unit variance (technically should be done after train test split )

In [23]:

# (technically should be done after train test split )
X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train)
X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)
X_train[0:5]

Out[23]:

array([[ 0.52533066, -0.97207061,  0.53495582, -0.45109685,  0.85993942,
         2.76134025, -0.86846836, -0.91206272],
       [ 0.52533066,  0.91317564, -0.47326137,  2.21681883,  0.85993942,
        -0.36214298, -0.86846836,  1.09641582],
       [ 0.52533066,  0.91317564, -0.13718897, -0.45109685,  0.85993942,
        -0.36214298,  1.15145243, -0.91206272],
       [ 0.52533066,  0.91317564, -0.64129757, -0.45109685,  0.85993942,
        -0.36214298, -0.86846836,  1.09641582],
       [ 0.52533066, -0.97207061, -0.47326137,  2.21681883, -1.16287262,
        -0.36214298, -0.86846836,  1.09641582]])
Classification

Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:

  • K Nearest Neighbor(KNN)
  • Decision Tree
  • Support Vector Machine
  • Logistic Regression

Notice:

  • You can go above and change the pre-processing, feature selection, feature-extraction, and so on, to make a better model.
  • You should use either scikit-learn, Scipy or Numpy libraries for developing the classification algorithms.
  • You should include the code of the algorithm in the following cells.
K Nearest Neighbor(KNN)

Notice: You should find the best k to build the model with the best accuracy.
warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k.

In [24]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [25]:

# We will be checking for value of K from 0 to 25 so
n = 25
accuracy = np.zeros(n)
for i in range(1,n+1):
    clf = KNeighborsClassifier(n_neighbors = i).fit(X_train, y_train)
    y_test_predicted = clf.predict(X_test)
    accuracy[i-1] = (accuracy_score(y_test, y_test_predicted))
accuracy

Out[25]:

array([ 0.65467626,  0.72661871,  0.67625899,  0.72661871,  0.69064748,
        0.69064748,  0.64748201,  0.69064748,  0.63309353,  0.69784173,
        0.69064748,  0.69784173,  0.69064748,  0.69784173,  0.6618705 ,
        0.69064748,  0.6618705 ,  0.67625899,  0.67625899,  0.67625899,
        0.70503597,  0.71223022,  0.72661871,  0.74100719,  0.74100719])

In [26]:

plt.plot(range(1,n+1),accuracy,'g')
plt.ylabel('Accuracy')
plt.xlabel('Number of Neighbors (K)')
plt.show()

accuracy = pd.DataFrame(accuracy)
print("Maximum Accuracy Got is - " )
accuracy.sort_values(by = 0, ascending = False)[0:1]
Maximum Accuracy Got is - 

Out[26]:

0
240.741007

In [27]:

clf_KNN = KNeighborsClassifier(n_neighbors = 24).fit(X_train, y_train)
# y_test_pred_KNN = clf_KNN.predict(X_test)
Decision Tree

In [28]:

from sklearn.tree import DecisionTreeClassifier

In [29]:

clf2 = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train)
y_test_pred_KNN = clf.predict(X_test)
print("Accuracy using criterion as gini - ", accuracy_score(y_test, y_test_pred_KNN))
clf3 = DecisionTreeClassifier(criterion = 'entropy').fit(X_train, y_train)
y_test_pred_KNN = clf2.predict(X_test)
print("Accuracy using criterion as entropy - ", accuracy_score(y_test, y_test_pred_KNN))
Accuracy using criterion as gini -  0.741007194245
Accuracy using criterion as entropy -  0.654676258993

In [30]:

# using criterion as gini
clf_DT = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train)
Support Vector Machine

In [31]:

from sklearn.svm import SVC

In [32]:

clf4 = SVC(kernel = 'poly').fit(X_train, y_train)
print("accuracy using polynomial kernel - ", accuracy_score(y_test, clf2.predict(X_test)))
clf5 = SVC(kernel = 'rbf').fit(X_train, y_train)
print("accuracy using Radial Basis function kernel - ", accuracy_score(y_test, clf3.predict(X_test)))
accuracy using polynomial kernel -  0.654676258993
accuracy using Radial Basis function kernel -  0.640287769784

In [33]:

# using linear for kernel in our SVM model
clf_SVM = SVC(kernel = 'poly', random_state = 4).fit(X_train, y_train)
Logistic Regression

In [34]:

from sklearn.linear_model import LogisticRegression

In [35]:

clf_LR = LogisticRegression(solver='lbfgs', warm_start = True)
clf_LR.fit(X_train, y_train)
#print("accuracy score - ", accuracy_score(y_test, clf_LR.predict(X_test)))

Out[35]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=True)
Model Evaluation using Test set

In [36]:

from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

First, download and load the test set:

In [37]:

!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv
--2019-05-26 01:25:52--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3642 (3.6K) [text/csv]
Saving to: ‘loan_test.csv’

100%[======================================>] 3,642       --.-K/s   in 0s      

2019-05-26 01:25:52 (585 MB/s) - ‘loan_test.csv’ saved [3642/3642]

Load Test set for evaluation

In [38]:

test_df = pd.read_csv('loan_test.csv')
test_df.head()

Out[38]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGender
011PAIDOFF1000309/8/201610/7/201650Bechalorfemale
155PAIDOFF30079/9/20169/15/201635Master or Abovemale
22121PAIDOFF1000309/10/201610/9/201643High School or Belowfemale
32424PAIDOFF1000309/10/201610/9/201626collegemale
43535PAIDOFF800159/11/20169/25/201629Bechalormale
Test Data Pre-processing: Feature selection/extraction same as done with train data

In [39]:

# conversion to datetime object
test_df['due_date'] = pd.to_datetime(test_df['due_date'])
test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
test_df.head()

Out[39]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGender
011PAIDOFF1000302016-09-082016-10-0750Bechalorfemale
155PAIDOFF30072016-09-092016-09-1535Master or Abovemale
22121PAIDOFF1000302016-09-102016-10-0943High School or Belowfemale
32424PAIDOFF1000302016-09-102016-10-0926collegemale
43535PAIDOFF800152016-09-112016-09-2529Bechalormale

In [40]:

# creating weekend column
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
test_df.head()

Out[40]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGenderdayofweekweekend
011PAIDOFF1000302016-09-082016-10-0750Bechalorfemale30
155PAIDOFF30072016-09-092016-09-1535Master or Abovemale41
22121PAIDOFF1000302016-09-102016-10-0943High School or Belowfemale51
32424PAIDOFF1000302016-09-102016-10-0926collegemale51
43535PAIDOFF800152016-09-112016-09-2529Bechalormale61

In [41]:

# replacing male and female string with 0 and 1
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
test_df.head()

Out[41]:

Unnamed: 0Unnamed: 0.1loan_statusPrincipaltermseffective_datedue_dateageeducationGenderdayofweekweekend
011PAIDOFF1000302016-09-082016-10-0750Bechalor130
155PAIDOFF30072016-09-092016-09-1535Master or Above041
22121PAIDOFF1000302016-09-102016-10-0943High School or Below151
32424PAIDOFF1000302016-09-102016-10-0926college051
43535PAIDOFF800152016-09-112016-09-2529Bechalor061

In [42]:

# creaing dummies and dropping one column

X_t = test_df[['Principal','terms','age','Gender','weekend']]
X_t = pd.concat([X_t,pd.get_dummies(test_df['education'])], axis=1)
X_t.drop(['Master or Above'], axis = 1,inplace=True)
X_t.head()

Out[42]:

PrincipaltermsageGenderweekendBechalorHigh School or Belowcollege
01000305010100
130073501000
21000304311010
31000302601001
4800152901100

In [43]:

# Feature Scaling
X_t = preprocessing.StandardScaler().fit(X_t).transform(X_t)
X_t[0:5]

Out[43]:

array([[ 0.49362588,  0.92844966,  3.05981865,  1.97714211, -1.30384048,
         2.39791576, -0.79772404, -0.86135677],
       [-3.56269116, -1.70427745,  0.53336288, -0.50578054,  0.76696499,
        -0.41702883, -0.79772404, -0.86135677],
       [ 0.49362588,  0.92844966,  1.88080596,  1.97714211,  0.76696499,
        -0.41702883,  1.25356634, -0.86135677],
       [ 0.49362588,  0.92844966, -0.98251057, -0.50578054,  0.76696499,
        -0.41702883, -0.79772404,  1.16095912],
       [-0.66532184, -0.78854628, -0.47721942, -0.50578054,  0.76696499,
         2.39791576, -0.79772404, -0.86135677]])

In [44]:

test_df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[0,1],inplace=True)

In [45]:

y_t = test_df['loan_status'].values
y_t[0:20]

Out[45]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [46]:

# np arrays to store intermediate result
Jaccard = np.full(4, np.nan)
F1_score = np.full(4, np.nan)
LogLoss = np.full(4, np.nan)
Algorithm = np.array(4)
Algorithm = ["KNN", "Decision Tree", "SVM", "LogisticRegression"]

In [47]:

Jaccard[0] = jaccard_similarity_score(y_t, clf_KNN.predict(X_t))
Jaccard[1] = jaccard_similarity_score(y_t, clf_DT.predict(X_t))
Jaccard[2] = jaccard_similarity_score(y_t, clf_SVM.predict(X_t))
Jaccard[3] = jaccard_similarity_score(y_t, clf_LR.predict(X_t))

In [48]:

F1_score[0] = f1_score(y_t, clf_KNN.predict(X_t))
F1_score[1] = f1_score(y_t, clf_DT.predict(X_t))
F1_score[2] = f1_score(y_t, clf_SVM.predict(X_t))
F1_score[3] = f1_score(y_t, clf_LR.predict(X_t))

In [49]:

LogLoss[3] = log_loss(y_t, clf_LR.predict_proba(X_t))

In [50]:

Report = pd.DataFrame({"Jaccard":Jaccard, "F1-score":F1_score, "LogLoss":LogLoss}, index=Algorithm)
Report

In [51]:

Report

Out[51]:

F1-scoreJaccardLogLoss
KNN0.2352940.759259NaN
Decision Tree0.3000000.740741NaN
SVM0.3529410.796296NaN
LogisticRegression0.2500000.7777780.470967
Conclusion:

I hope this Machine Learning with Python Peer-graded Assignment Solutions would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more peer graded solutions.

This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.

Keep Learning!

More Peer-graded Assignment Solutions >>

Peer-graded Assignment: Your Two-Minute Pitch Solution

Peer-graded Assignment: The Role of Work Solution

Peer-graded Assignment: Your Strengths as a Team Player Solution

Leave a Reply

Your email address will not be published. Required fields are marked *