Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
This course dives into the basics of machine learning using an approachable, and well-known programming language, Python.
In this course, we will be reviewing two main components: First, you will be learning about the purpose of Machine Learning and where it applies to the real world. Second, you will get a general overview of Machine Learning topics such as supervised vs unsupervised learning, model evaluation, and Machine Learning algorithms.
Now that you have been equipped with the skills to use different Machine Learning algorithms, over the course of five weeks, you will have the opportunity to practice and apply it on a dataset. In this project, you will complete a notebook where you will build a classifier to predict whether a loan case will be paid off or not.
You load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. You are expected to use the following algorithms to build your models:
The results is reported as the accuracy of each classifier, using the following metrics when these are applicable:
In this notebook we try to practice all the classification algorithms that we learned in this course.
We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.
Lets first load required libraries:
In [1]:
import itertools import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import NullFormatter import pandas as pd import numpy as np import matplotlib.ticker as ticker from sklearn import preprocessing %matplotlib inline
This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:
Field | Description |
---|---|
Loan_status | Whether a loan is paid off on in collection |
Principal | Basic principal loan amount at the |
Terms | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
Effective_date | When the loan got originated and took effects |
Due_date | Since it’s one-time payoff schedule, each loan has one single due date |
Age | Age of applicant |
Education | Education of applicant |
Gender | The gender of applicant |
Lets download the dataset
In [2]:
!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv
--2019-05-26 01:24:50-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193 Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 23101 (23K) [text/csv] Saving to: ‘loan_train.csv’ 100%[======================================>] 23,101 --.-K/s in 0.002s 2019-05-26 01:24:50 (12.7 MB/s) - ‘loan_train.csv’ saved [23101/23101]
In [3]:
df = pd.read_csv('loan_train.csv') df.head(10)
Out[3]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 45 | High School or Below | male |
1 | 2 | 2 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 33 | Bechalor | female |
2 | 3 | 3 | PAIDOFF | 1000 | 15 | 9/8/2016 | 9/22/2016 | 27 | college | male |
3 | 4 | 4 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 28 | college | female |
4 | 6 | 6 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 29 | college | male |
5 | 7 | 7 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 36 | college | male |
6 | 8 | 8 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 28 | college | male |
7 | 9 | 9 | PAIDOFF | 800 | 15 | 9/10/2016 | 9/24/2016 | 26 | college | male |
8 | 10 | 10 | PAIDOFF | 300 | 7 | 9/10/2016 | 9/16/2016 | 29 | college | male |
9 | 11 | 11 | PAIDOFF | 1000 | 15 | 9/10/2016 | 10/9/2016 | 39 | High School or Below | male |
In [4]:
df.shape
Out[4]:
(346, 10)
In [5]:
df['due_date'] = pd.to_datetime(df['due_date']) df['effective_date'] = pd.to_datetime(df['effective_date']) df.head()
Out[5]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 45 | High School or Below | male |
1 | 2 | 2 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 33 | Bechalor | female |
2 | 3 | 3 | PAIDOFF | 1000 | 15 | 2016-09-08 | 2016-09-22 | 27 | college | male |
3 | 4 | 4 | PAIDOFF | 1000 | 30 | 2016-09-09 | 2016-10-08 | 28 | college | female |
4 | 6 | 6 | PAIDOFF | 1000 | 30 | 2016-09-09 | 2016-10-08 | 29 | college | male |
Let’s see how many of each class is in our data set
In [6]:
df['loan_status'].value_counts()
Out[6]:
PAIDOFF 260 COLLECTION 86 Name: loan_status, dtype: int64
260 people have paid off the loan on time while 86 have gone into collection
Lets plot some columns to underestand data better:
In [7]:
# notice: installing seaborn might takes a few minutes # !conda install -c anaconda seaborn -y
In [8]:
import seaborn as sns bins = np.linspace(df.Principal.min(), df.Principal.max(), 10) g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2) g.map(plt.hist, 'Principal', bins=bins, ec="k") g.axes[-1].legend() plt.show()
In [9]:
bins = np.linspace(df.age.min(), df.age.max(), 10) g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2) g.map(plt.hist, 'age', bins=bins, ec="k") g.axes[-1].legend() plt.show()
In [10]:
df['dayofweek'] = df['effective_date'].dt.dayofweek bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10) g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2) g.map(plt.hist, 'dayofweek', bins=bins, ec="k") g.axes[-1].legend() plt.show()
We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4
In [11]:
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0) df.head()
Out[11]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | dayofweek | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 45 | High School or Below | male | 3 | 0 |
1 | 2 | 2 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 33 | Bechalor | female | 3 | 0 |
2 | 3 | 3 | PAIDOFF | 1000 | 15 | 2016-09-08 | 2016-09-22 | 27 | college | male | 3 | 0 |
3 | 4 | 4 | PAIDOFF | 1000 | 30 | 2016-09-09 | 2016-10-08 | 28 | college | female | 4 | 1 |
4 | 6 | 6 | PAIDOFF | 1000 | 30 | 2016-09-09 | 2016-10-08 | 29 | college | male | 4 | 1 |
Lets look at gender:
In [12]:
df.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
Out[12]:
Gender loan_status female PAIDOFF 0.865385 COLLECTION 0.134615 male PAIDOFF 0.731293 COLLECTION 0.268707 Name: loan_status, dtype: float64
86 % of female pay there loans while only 73 % of males pay there loan
Lets convert male to 0 and female to 1:
In [13]:
df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True) df.head()
Out[13]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | dayofweek | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 45 | High School or Below | 0 | 3 | 0 |
1 | 2 | 2 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 33 | Bechalor | 1 | 3 | 0 |
2 | 3 | 3 | PAIDOFF | 1000 | 15 | 2016-09-08 | 2016-09-22 | 27 | college | 0 | 3 | 0 |
3 | 4 | 4 | PAIDOFF | 1000 | 30 | 2016-09-09 | 2016-10-08 | 28 | college | 1 | 4 | 1 |
4 | 6 | 6 | PAIDOFF | 1000 | 30 | 2016-09-09 | 2016-10-08 | 29 | college | 0 | 4 | 1 |
In [14]:
df.groupby(['education'])['loan_status'].value_counts(normalize=True)
Out[14]:
education loan_status Bechalor PAIDOFF 0.750000 COLLECTION 0.250000 High School or Below PAIDOFF 0.741722 COLLECTION 0.258278 Master or Above COLLECTION 0.500000 PAIDOFF 0.500000 college PAIDOFF 0.765101 COLLECTION 0.234899 Name: loan_status, dtype: float64
In [15]:
df[['Principal','terms','age','Gender','education']].head()
Out[15]:
Principal | terms | age | Gender | education | |
---|---|---|---|---|---|
0 | 1000 | 30 | 45 | 0 | High School or Below |
1 | 1000 | 30 | 33 | 1 | Bechalor |
2 | 1000 | 15 | 27 | 0 | college |
3 | 1000 | 30 | 28 | 1 | college |
4 | 1000 | 30 | 29 | 0 | college |
In [16]:
Feature = df[['Principal','terms','age','Gender','weekend']] Feature = pd.concat([Feature,pd.get_dummies(df['education'])], axis=1) Feature.drop(['Master or Above'], axis = 1,inplace=True) Feature.head()
Out[16]:
Principal | terms | age | Gender | weekend | Bechalor | High School or Below | college | |
---|---|---|---|---|---|---|---|---|
0 | 1000 | 30 | 45 | 0 | 0 | 0 | 1 | 0 |
1 | 1000 | 30 | 33 | 1 | 0 | 1 | 0 | 0 |
2 | 1000 | 15 | 27 | 0 | 0 | 0 | 0 | 1 |
3 | 1000 | 30 | 28 | 1 | 1 | 0 | 0 | 1 |
4 | 1000 | 30 | 29 | 0 | 1 | 0 | 0 | 1 |
Lets defind feature sets, X:
In [17]:
X = Feature X[0:5]
Out[17]:
Principal | terms | age | Gender | weekend | Bechalor | High School or Below | college | |
---|---|---|---|---|---|---|---|---|
0 | 1000 | 30 | 45 | 0 | 0 | 0 | 1 | 0 |
1 | 1000 | 30 | 33 | 1 | 0 | 1 | 0 | 0 |
2 | 1000 | 15 | 27 | 0 | 0 | 0 | 0 | 1 |
3 | 1000 | 30 | 28 | 1 | 1 | 0 | 0 | 1 |
4 | 1000 | 30 | 29 | 0 | 1 | 0 | 0 | 1 |
What are our lables?
In [18]:
df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[0,1],inplace=True) df.head()
Out[18]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | dayofweek | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1000 | 30 | 2016-09-08 | 2016-10-07 | 45 | High School or Below | 0 | 3 | 0 |
1 | 2 | 2 | 0 | 1000 | 30 | 2016-09-08 | 2016-10-07 | 33 | Bechalor | 1 | 3 | 0 |
2 | 3 | 3 | 0 | 1000 | 15 | 2016-09-08 | 2016-09-22 | 27 | college | 0 | 3 | 0 |
3 | 4 | 4 | 0 | 1000 | 30 | 2016-09-09 | 2016-10-08 | 28 | college | 1 | 4 | 1 |
4 | 6 | 6 | 0 | 1000 | 30 | 2016-09-09 | 2016-10-08 | 29 | college | 0 | 4 | 1 |
In [19]:
y = df['loan_status'].values y[0:20]
Out[19]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [20]:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.40, random_state=5)
In [21]:
print ('Train set:', X_train.shape, y_train.shape) print ('Test set:', X_test.shape, y_test.shape)
Train set: (207, 8) (207,) Test set: (139, 8) (139,)
In [22]:
display(X_train.shape, y_train.shape) display(X_test.shape, y_test.shape)
(207, 8)
(207,)
(139, 8)
(139,)
Data Standardization give data zero mean and unit variance (technically should be done after train test split )
In [23]:
# (technically should be done after train test split ) X_train = preprocessing.StandardScaler().fit(X_train).transform(X_train) X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test) X_train[0:5]
Out[23]:
array([[ 0.52533066, -0.97207061, 0.53495582, -0.45109685, 0.85993942, 2.76134025, -0.86846836, -0.91206272], [ 0.52533066, 0.91317564, -0.47326137, 2.21681883, 0.85993942, -0.36214298, -0.86846836, 1.09641582], [ 0.52533066, 0.91317564, -0.13718897, -0.45109685, 0.85993942, -0.36214298, 1.15145243, -0.91206272], [ 0.52533066, 0.91317564, -0.64129757, -0.45109685, 0.85993942, -0.36214298, -0.86846836, 1.09641582], [ 0.52533066, -0.97207061, -0.47326137, 2.21681883, -1.16287262, -0.36214298, -0.86846836, 1.09641582]])
Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:
Notice:
Notice: You should find the best k to build the model with the best accuracy.
warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k.
In [24]:
from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score
In [25]:
# We will be checking for value of K from 0 to 25 so n = 25 accuracy = np.zeros(n) for i in range(1,n+1): clf = KNeighborsClassifier(n_neighbors = i).fit(X_train, y_train) y_test_predicted = clf.predict(X_test) accuracy[i-1] = (accuracy_score(y_test, y_test_predicted)) accuracy
Out[25]:
array([ 0.65467626, 0.72661871, 0.67625899, 0.72661871, 0.69064748, 0.69064748, 0.64748201, 0.69064748, 0.63309353, 0.69784173, 0.69064748, 0.69784173, 0.69064748, 0.69784173, 0.6618705 , 0.69064748, 0.6618705 , 0.67625899, 0.67625899, 0.67625899, 0.70503597, 0.71223022, 0.72661871, 0.74100719, 0.74100719])
In [26]:
plt.plot(range(1,n+1),accuracy,'g') plt.ylabel('Accuracy') plt.xlabel('Number of Neighbors (K)') plt.show() accuracy = pd.DataFrame(accuracy) print("Maximum Accuracy Got is - " ) accuracy.sort_values(by = 0, ascending = False)[0:1]
Maximum Accuracy Got is -
Out[26]:
0 | |
---|---|
24 | 0.741007 |
In [27]:
clf_KNN = KNeighborsClassifier(n_neighbors = 24).fit(X_train, y_train) # y_test_pred_KNN = clf_KNN.predict(X_test)
In [28]:
from sklearn.tree import DecisionTreeClassifier
In [29]:
clf2 = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train) y_test_pred_KNN = clf.predict(X_test) print("Accuracy using criterion as gini - ", accuracy_score(y_test, y_test_pred_KNN)) clf3 = DecisionTreeClassifier(criterion = 'entropy').fit(X_train, y_train) y_test_pred_KNN = clf2.predict(X_test) print("Accuracy using criterion as entropy - ", accuracy_score(y_test, y_test_pred_KNN))
Accuracy using criterion as gini - 0.741007194245 Accuracy using criterion as entropy - 0.654676258993
In [30]:
# using criterion as gini clf_DT = DecisionTreeClassifier(criterion = 'gini').fit(X_train, y_train)
In [31]:
from sklearn.svm import SVC
In [32]:
clf4 = SVC(kernel = 'poly').fit(X_train, y_train) print("accuracy using polynomial kernel - ", accuracy_score(y_test, clf2.predict(X_test))) clf5 = SVC(kernel = 'rbf').fit(X_train, y_train) print("accuracy using Radial Basis function kernel - ", accuracy_score(y_test, clf3.predict(X_test)))
accuracy using polynomial kernel - 0.654676258993 accuracy using Radial Basis function kernel - 0.640287769784
In [33]:
# using linear for kernel in our SVM model clf_SVM = SVC(kernel = 'poly', random_state = 4).fit(X_train, y_train)
In [34]:
from sklearn.linear_model import LogisticRegression
In [35]:
clf_LR = LogisticRegression(solver='lbfgs', warm_start = True) clf_LR.fit(X_train, y_train) #print("accuracy score - ", accuracy_score(y_test, clf_LR.predict(X_test)))
Out[35]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=True)
In [36]:
from sklearn.metrics import jaccard_similarity_score from sklearn.metrics import f1_score from sklearn.metrics import log_loss
First, download and load the test set:
In [37]:
!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv
--2019-05-26 01:25:52-- https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.193 Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.193|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 3642 (3.6K) [text/csv] Saving to: ‘loan_test.csv’ 100%[======================================>] 3,642 --.-K/s in 0s 2019-05-26 01:25:52 (585 MB/s) - ‘loan_test.csv’ saved [3642/3642]
In [38]:
test_df = pd.read_csv('loan_test.csv') test_df.head()
Out[38]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 50 | Bechalor | female |
1 | 5 | 5 | PAIDOFF | 300 | 7 | 9/9/2016 | 9/15/2016 | 35 | Master or Above | male |
2 | 21 | 21 | PAIDOFF | 1000 | 30 | 9/10/2016 | 10/9/2016 | 43 | High School or Below | female |
3 | 24 | 24 | PAIDOFF | 1000 | 30 | 9/10/2016 | 10/9/2016 | 26 | college | male |
4 | 35 | 35 | PAIDOFF | 800 | 15 | 9/11/2016 | 9/25/2016 | 29 | Bechalor | male |
In [39]:
# conversion to datetime object test_df['due_date'] = pd.to_datetime(test_df['due_date']) test_df['effective_date'] = pd.to_datetime(test_df['effective_date']) test_df.head()
Out[39]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 50 | Bechalor | female |
1 | 5 | 5 | PAIDOFF | 300 | 7 | 2016-09-09 | 2016-09-15 | 35 | Master or Above | male |
2 | 21 | 21 | PAIDOFF | 1000 | 30 | 2016-09-10 | 2016-10-09 | 43 | High School or Below | female |
3 | 24 | 24 | PAIDOFF | 1000 | 30 | 2016-09-10 | 2016-10-09 | 26 | college | male |
4 | 35 | 35 | PAIDOFF | 800 | 15 | 2016-09-11 | 2016-09-25 | 29 | Bechalor | male |
In [40]:
# creating weekend column test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0) test_df.head()
Out[40]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | dayofweek | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 50 | Bechalor | female | 3 | 0 |
1 | 5 | 5 | PAIDOFF | 300 | 7 | 2016-09-09 | 2016-09-15 | 35 | Master or Above | male | 4 | 1 |
2 | 21 | 21 | PAIDOFF | 1000 | 30 | 2016-09-10 | 2016-10-09 | 43 | High School or Below | female | 5 | 1 |
3 | 24 | 24 | PAIDOFF | 1000 | 30 | 2016-09-10 | 2016-10-09 | 26 | college | male | 5 | 1 |
4 | 35 | 35 | PAIDOFF | 800 | 15 | 2016-09-11 | 2016-09-25 | 29 | Bechalor | male | 6 | 1 |
In [41]:
# replacing male and female string with 0 and 1 test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True) test_df.head()
Out[41]:
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | dayofweek | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | PAIDOFF | 1000 | 30 | 2016-09-08 | 2016-10-07 | 50 | Bechalor | 1 | 3 | 0 |
1 | 5 | 5 | PAIDOFF | 300 | 7 | 2016-09-09 | 2016-09-15 | 35 | Master or Above | 0 | 4 | 1 |
2 | 21 | 21 | PAIDOFF | 1000 | 30 | 2016-09-10 | 2016-10-09 | 43 | High School or Below | 1 | 5 | 1 |
3 | 24 | 24 | PAIDOFF | 1000 | 30 | 2016-09-10 | 2016-10-09 | 26 | college | 0 | 5 | 1 |
4 | 35 | 35 | PAIDOFF | 800 | 15 | 2016-09-11 | 2016-09-25 | 29 | Bechalor | 0 | 6 | 1 |
In [42]:
# creaing dummies and dropping one column X_t = test_df[['Principal','terms','age','Gender','weekend']] X_t = pd.concat([X_t,pd.get_dummies(test_df['education'])], axis=1) X_t.drop(['Master or Above'], axis = 1,inplace=True) X_t.head()
Out[42]:
Principal | terms | age | Gender | weekend | Bechalor | High School or Below | college | |
---|---|---|---|---|---|---|---|---|
0 | 1000 | 30 | 50 | 1 | 0 | 1 | 0 | 0 |
1 | 300 | 7 | 35 | 0 | 1 | 0 | 0 | 0 |
2 | 1000 | 30 | 43 | 1 | 1 | 0 | 1 | 0 |
3 | 1000 | 30 | 26 | 0 | 1 | 0 | 0 | 1 |
4 | 800 | 15 | 29 | 0 | 1 | 1 | 0 | 0 |
In [43]:
# Feature Scaling X_t = preprocessing.StandardScaler().fit(X_t).transform(X_t) X_t[0:5]
Out[43]:
array([[ 0.49362588, 0.92844966, 3.05981865, 1.97714211, -1.30384048, 2.39791576, -0.79772404, -0.86135677], [-3.56269116, -1.70427745, 0.53336288, -0.50578054, 0.76696499, -0.41702883, -0.79772404, -0.86135677], [ 0.49362588, 0.92844966, 1.88080596, 1.97714211, 0.76696499, -0.41702883, 1.25356634, -0.86135677], [ 0.49362588, 0.92844966, -0.98251057, -0.50578054, 0.76696499, -0.41702883, -0.79772404, 1.16095912], [-0.66532184, -0.78854628, -0.47721942, -0.50578054, 0.76696499, 2.39791576, -0.79772404, -0.86135677]])
In [44]:
test_df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[0,1],inplace=True)
In [45]:
y_t = test_df['loan_status'].values y_t[0:20]
Out[45]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [46]:
# np arrays to store intermediate result Jaccard = np.full(4, np.nan) F1_score = np.full(4, np.nan) LogLoss = np.full(4, np.nan) Algorithm = np.array(4) Algorithm = ["KNN", "Decision Tree", "SVM", "LogisticRegression"]
In [47]:
Jaccard[0] = jaccard_similarity_score(y_t, clf_KNN.predict(X_t)) Jaccard[1] = jaccard_similarity_score(y_t, clf_DT.predict(X_t)) Jaccard[2] = jaccard_similarity_score(y_t, clf_SVM.predict(X_t)) Jaccard[3] = jaccard_similarity_score(y_t, clf_LR.predict(X_t))
In [48]:
F1_score[0] = f1_score(y_t, clf_KNN.predict(X_t)) F1_score[1] = f1_score(y_t, clf_DT.predict(X_t)) F1_score[2] = f1_score(y_t, clf_SVM.predict(X_t)) F1_score[3] = f1_score(y_t, clf_LR.predict(X_t))
In [49]:
LogLoss[3] = log_loss(y_t, clf_LR.predict_proba(X_t))
In [50]:
Report = pd.DataFrame({"Jaccard":Jaccard, "F1-score":F1_score, "LogLoss":LogLoss}, index=Algorithm)
In [51]:
Report
Out[51]:
F1-score | Jaccard | LogLoss | |
---|---|---|---|
KNN | 0.235294 | 0.759259 | NaN |
Decision Tree | 0.300000 | 0.740741 | NaN |
SVM | 0.352941 | 0.796296 | NaN |
LogisticRegression | 0.250000 | 0.777778 | 0.470967 |
I hope this Machine Learning with Python Peer-graded Assignment Solutions would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more peer graded solutions.
This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.
Keep Learning!
More Peer-graded Assignment Solutions >>
Peer-graded Assignment: Your Two-Minute Pitch Solution
Peer-graded Assignment: The Role of Work Solution
Peer-graded Assignment: Your Strengths as a Team Player Solution