304 North Cardinal St.
Dorchester Center, MA 02124

# Data Analysis with Python Peer Graded assignment Solution – Why Quiz

## Project Scenario: Data Analysis with Python Peer Graded assignment Solution

In this assignment, you are a Data Analyst working at a Real Estate Investment Trust. The Trust would like to start investing in Residential real estate. You are tasked with determining the market price of a house given a set of features. You will analyze and predict housing prices using attributes or features such as square footage, number of bedrooms, number of floors, and so on. A template notebook is provided in the lab; your job is to complete the ten questions. Some hints to the questions are given in the template notebook.

Dataset Used in this Assignment

The dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. It was taken from here. It was also slightly modified for the purposes of this course.

For this project, you will utilize JupyterLab running on the Cloud in Skills Network Labs environment.

#### Instructions: Data Analysis with Python Peer Graded assignment Solution

Here you are!

I hoped you enjoyed playing Data Scientist at a Real Estate Investment Trust. Well done!

This rubric will provide you with a grade breakdown for the evaluation of the final project of your peers.

### Data Analysis with Python Peer Graded assignment Solution

#### House Sales in King County, USA

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

id : A notation for a house

date: Date house was sold

price: Price is prediction target

bedrooms: Number of bedrooms

bathrooms: Number of bathrooms

sqft_living: Square footage of the home

sqft_lot: Square footage of the lot

floors :Total floors (levels) in house

waterfront :House which has a view to a waterfront

view: Has been viewed

condition :How good the condition is overall

sqft_above : Square footage of house apart from basement

sqft_basement: Square footage of the basement

yr_built : Built Year

yr_renovated : Year when house was renovated

zipcode: Zip code

lat: Latitude coordinate

long: Longitude coordinate

sqft_living15 : Living room area in 2015(implies– some renovations) This might or might not have affected the lotsize area

sqft_lot15 : LotSize area in 2015(implies– some renovations)

You will require the following libraries:

In [1]:

```import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline```

#### Module 1: Importing Data Sets

In [2]:

```file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'

We use the method `head` to display the first 5 columns of the dataframe.

In [3]:

`df.head()`

Out[3]:

5 rows × 22 columns

### Question 1

Display the data types of each column using the attribute dtype, then take a screenshot and submit it, include your code in the image.

In [4]:

`df.dtypes`

Out[4]:

```Unnamed: 0         int64
id                 int64
date              object
price            float64
bedrooms         float64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object```

We use the method describe to obtain a statistical summary of the dataframe.

In [5]:

`df.describe()`

Out[5]:

8 rows × 21 columns

#### Question 2

Drop the columns `"id"` and `"Unnamed: 0"` from axis 1 using the method `drop()`, then use the method `describe()` to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the `inplace` parameter is set to `True`

In [6]:

```df.drop("id", axis=1,inplace=True)
df.drop("Unnamed: 0", axis=1, inplace=True)

df.describe()```

Out[6]:

We can see we have missing values for the columns ` bedrooms` and ` bathrooms`

In [7]:

```print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())```
```number of NaN values for the column bedrooms : 13
number of NaN values for the column bathrooms : 10```

We can replace the missing values of the column `'bedrooms'` with the mean of the column `'bedrooms' ` using the method `replace()`. Don’t forget to set the `inplace` parameter to `True`

In [8]:

```mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)```

We also replace the missing values of the column `'bathrooms'` with the mean of the column `'bathrooms' ` using the method `replace()`. Don’t forget to set the ` inplace ` parameter top ` True`

In [9]:

```mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)```

In [10]:

```print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())```
```number of NaN values for the column bedrooms : 0
number of NaN values for the column bathrooms : 0```

#### Question 3

Use the method `value_counts` to count the number of houses with unique floor values, use the method `.to_frame()` to convert it to a dataframe.

In [11]:

`df['floors'].value_counts().to_frame()`

Out[11]:

#### Question 4

Use the function `boxplot` in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.

In [13]:

`sns.boxplot(x="waterfront", y="price", data=df)`

Out[13]:

`<matplotlib.axes._subplots.AxesSubplot at 0x7f456e280fd0>`

#### Question 5

Use the function `regplot` in the seaborn library to determine if the feature `sqft_above` is negatively or positively correlated with price.

In [15]:

`sns.regplot(x="sqft_above", y="price", data=df, ci=None)`

Out[15]:

`<matplotlib.axes._subplots.AxesSubplot at 0x7f456d1cd910>`

We can use the Pandas method `corr()` to find the feature other than price that is most correlated with price.

In [16]:

`df.corr()['price'].sort_values()`

Out[16]:

```zipcode         -0.053203
long             0.021626
condition        0.036362
yr_built         0.054012
sqft_lot15       0.082447
sqft_lot         0.089661
yr_renovated     0.126434
floors           0.256794
waterfront       0.266369
lat              0.307003
bedrooms         0.308797
sqft_basement    0.323816
view             0.397293
bathrooms        0.525738
sqft_living15    0.585379
sqft_above       0.605567
sqft_living      0.702035
price            1.000000
Name: price, dtype: float64```

#### Module 4: Model Development

We can Fit a linear regression model using the longitude feature `'long'` and caculate the R^2.

In [17]:

```X = df[['long']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
lm.score(X, Y)```

Out[17]:

`0.00046769430149007363`

#### Question 6

Fit a linear regression model to predict the `'price'` using the feature `'sqft_living'` then calculate the R^2. Take a screenshot of your code and the value of the R^2.

In [18]:

```X1 = df[['sqft_living']]
Y1 = df['price']
lm = LinearRegression()
lm
lm.fit(X1,Y1)
lm.score(X1, Y1)```

Out[18]:

`0.4928532179037931`

#### Question 7

Fit a linear regression model to predict the `'price'` using the list of features:

In [19]:

`features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]     `

Then calculate the R^2. Take a screenshot of your code.

In [20]:

```X2 = df[features]
Y2 = df['price']
lm.fit(X2,Y2)
lm.score(X2,Y2)```

Out[20]:

`0.657679183672129`

#### This will help with Question 8

Create a list of tuples, the first element in the tuple contains the name of the estimator:

`'scale'`

`'polynomial'`

`'model'`

The second element in the tuple contains the model constructor

`StandardScaler()`

`PolynomialFeatures(include_bias=False)`

`LinearRegression()`

In [21]:

`Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]`

### Question 8

Use the list to create a pipeline object to predict the ‘price’, fit the object using the features in the list `features`, and calculate the R^2.

In [22]:

```pipe=Pipeline(Input)
pipe.fit(df[features],df['price'])
pipe.score(df[features],df['price'])```

Out[22]:

`0.7513408553309376`

#### Module 5: Model Evaluation and Refinement

Import the necessary modules:

In [23]:

```from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")```
`done`

We will split the data into training and testing sets:

In [24]:

```features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
X = df[features]
Y = df['price']

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)

print("number of test samples:", x_test.shape[0])
print("number of training samples:",x_train.shape[0])```
```number of test samples: 3242
number of training samples: 18371
```

### Question 9

Create and fit a Ridge regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data.

In [25]:

`from sklearn.linear_model import Ridge`

In [26]:

```RigeModel = Ridge(alpha=0.1)
RigeModel.fit(x_train, y_train)
RigeModel.score(x_test, y_test)```

Out[26]:

`0.6478759163939122`

### Question 10

Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.

In [27]:

```pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[features])
x_test_pr=pr.fit_transform(x_test[features])

RigeModel = Ridge(alpha=0.1)
RigeModel.fit(x_train_pr, y_train)
RigeModel.score(x_test_pr, y_test)
```

Out[27]:

`0.7002744279896707`

Once you complete your notebook you will have to share it. Select the icon on the top right a marked in red in the image below, a dialogue box should open, and select the option all content excluding sensitive code cells.

You can then share the notebook  via a  URL by scrolling down as shown in the following image:

##### Conclusion:

I hope this Data Analysis with Python Peer Graded assignment Solution would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more Quiz Answers.

Enroll on Coursera

This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.

Keep Learning!