The map in last blog post is a little hard to make too much sense of since it largely just reflects population, so we revised it.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(sf)
## Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(tmap)
## Registered S3 methods overwritten by 'stars':
## method from
## st_bbox.SpatRaster sf
## st_crs.SpatRaster sf
library(formattable)
mortgage <-read_csv(here::here('dataset','dcr_clean.csv'))
## Rows: 619660 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state_orig_time, month, day
## dbl (29): id, time, orig_time, first_time, mat_time, res_time, balance_time,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mortgage_last<-mortgage %>% group_by(id) %>% summarise_all(last)
epsg_us_equal_area <- 2163
us_states <- st_read(here::here("dataset/cb_2019_us_state_20m/cb_2019_us_state_20m.shp"))%>%
st_transform(epsg_us_equal_area)
## Reading layer `cb_2019_us_state_20m' from data source
## `/Users/alex/Desktop/fa2021-final-project-digfora/dataset/cb_2019_us_state_20m/cb_2019_us_state_20m.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 52 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
## Geodetic CRS: NAD83
not_contiguous <-
c("Guam", "Commonwealth of the Northern Mariana Islands",
"American Samoa", "Puerto Rico", "Alaska", "Hawaii",
"United States Virgin Islands")
us_cont <- us_states %>%
filter(!NAME %in% not_contiguous) %>% filter(NAME != "District of Columbia") %>%
transmute(STUSPS,STATEFP,NAME, geometry)
p <- mortgage %>% group_by(state_orig_time) %>% summarize(count = n(),def = sum(status_time[status_time == 1])) %>% mutate(proportion = formattable::percent(def / count))
us_cont %>% inner_join(p, by = c("STUSPS" = "state_orig_time")) %>% tm_shape() + tm_polygons(col = "proportion",palette = "RdBu") + tm_text("STUSPS", size = 1/2) + tm_layout(title= 'percentage of number of default')

q <- mortgage %>% group_by(state_orig_time) %>% summarize(total = sum(balance_orig_time),def = sum(balance_orig_time[status_time == 1])) %>% mutate(proportion = formattable::percent(def / total))
us_cont %>% inner_join(q, by = c("STUSPS" = "state_orig_time")) %>% tm_shape() + tm_polygons(col = "proportion",palette = "RdBu") + tm_text("STUSPS", size = 1/2) + tm_layout(title= 'percentage of default amount')

For this week, we are starting a construct a model to predict default rate of certain mortgage. Since we have a panel dataset–multiple records refer to a same mortgage over different period. We would not know whether a mortgage is default or payoff until we see last record of this mortgage. Therefore, it is a little bit complicated for us to build a logistic linear regression. We then decided to normalized the dataset again and try to collapse them to build a dataset with have only one record for one mortgage, including their start time, end time, their age,avergae GDP during the mortgage, status at last(default or pay off) and etc.
Firstly, we would like to determine the start time, age, time to maturity of We start by loading the dataset and packages:
library(tidyverse)
mortgage <- read_csv(here::here('dataset','dcr_clean.csv'))
## Rows: 619660 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state_orig_time, month, day
## dbl (29): id, time, orig_time, first_time, mat_time, res_time, balance_time,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#create a new column for the age of a mortgage
mortgage$age<- mortgage$time-mortgage$orig_time
#create a new column for the time to maturity of a mortgage
mortgage$TTM<-mortgage$mat_time-mortgage$orig_time
#create a continuous time variable for each records
mortgage$date=paste(as.character(mortgage$year),as.character(mortgage$month),as.character(mortgage$day),sep='-')
#found out the first-time record for all mortgages
mortgage_first<-mortgage[match(unique(mortgage$id), mortgage$id),]
#create a column for start time of each mortgage
mortgage_first$first_time_date<-mortgage_first$date
mortgage_first$first_time<-mortgage_first$time
#leave only the collumn that contains first time data of the mortgage
first <- mortgage_first %>% select(id, first_time, orig_time,mat_time,rate_time,REtype_CO_orig_time,REtype_PU_orig_time,REtype_SF_orig_time,investor_orig_time,balance_orig_time:Interest_Rate_orig_time,state_orig_time,hpi_orig_time,first_time_date, age, TTM)
As followed, we would like to determine the ending status of each mortgage(defualt, pay off or unfnished):
#Get the last record for all mortgages in the dataset
get_last <- mortgage %>% select(id,date,time,default_time,payoff_time,status_time) %>% group_by(id) %>% summarise_all(last)
#determine the end time of a mortgage
get_last$last_time_date<-get_last$date
get_last$last_time<-get_last$time
#determine the status of the mortgage
get_last$status_last<-get_last$status_time
atlast <- get_last %>% select(id,last_time_date:status_last,default_time)
As followed, we would like to get the mean value of GDP growth, Unemployment Rate, Risk_free rate, interest rate, house price index for each mortgage from their start to their end.
meanvalue <- mortgage %>% group_by(id) %>% summarise(interest_rate_mean = mean(interest_rate_time),gdp_mean = mean(gdp_time), risk_free_mean = mean(rate_time),hpi_mean = mean(hpi_time),uer_mean = mean(uer_time))
Now We would like to join the three dataset together to form the new dataset with one record for one mortgage:
mortgage_all <- first %>% left_join(atlast, by = 'id') %>% left_join(meanvalue, by = 'id')
We would like to add one more variable, time between the start of the mortgage and global financial crisis(time 37) to measure the influence of Financial Crisis to the mortgage
mortgage_all$time_to_GFC <- 37-mortgage_all$first_time
Here is the column expalnation of our new datasets.
newstr<-'id:borrower id
first_time:time stamp for first observation
orig_time:time stamp for origination
mat_time:time stamp for maturity
rate_time:risk-free rate
REtype_CO_orig_time:real estate type condominium: 1, otherwise: 0
REtype_PU_orig_time:real estate type planned urban developments: 1, otherwise: 0
REtype_SF_orig_time:single family home: 1, otherwise: 0
investor_orig_time:investor borrower: 1, otherwise: 0
balance_orig_time:outstanding balance at origination time
FICO_orig_time:FICO score at origination time, in %
LTV_orig_time:loan to value ratio at origination time, in %
Interest_Rate_orig_time:interest rate at origination time, in %
state_orig_time:US state in which the property is located
hpi_orig_time:house price index at observation time, base year=100
first_time_date:date of the first observation
age:number of periods from origination to time of observation
TTM:time to maturity at the time of observation
last_time_date:date of the last observation
last_time:time stamp for the last observation
status_last:debt status at the last observation: 0 is unfinished, 1 is default, 2 is paidoff
interest_rate_mean:the average interest rate of all observation periods
gdp_mean:the average GDP growth of all observation periods
risk_free_mean:the average risk free rate of all observation periods
hpi_mean:the average house price index of all observation periods
uer_mean:the average unemployment rate of all observation periods
time_to_GFC:number of periods from first observation to global financial crisis'
lista<-str_split(newstr, "\n")
keydataname<-data.frame(lista)
colnames(keydataname)<-'dataname'
keydataname<-keydataname %>% separate(dataname,into=c('Column Name','Description'),sep=':')
## Warning: Expected 2 pieces. Additional pieces discarded in 5 rows [6, 7, 8, 9,
## 21].
as.tibble(keydataname) %>%knitr::kable()
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
| Column Name | Description |
|---|---|
| id | borrower id |
| first_time | time stamp for first observation |
| orig_time | time stamp for origination |
| mat_time | time stamp for maturity |
| rate_time | risk-free rate |
| REtype_CO_orig_time | real estate type condominium |
| REtype_PU_orig_time | real estate type planned urban developments |
| REtype_SF_orig_time | single family home |
| investor_orig_time | investor borrower |
| balance_orig_time | outstanding balance at origination time |
| FICO_orig_time | FICO score at origination time, in % |
| LTV_orig_time | loan to value ratio at origination time, in % |
| Interest_Rate_orig_time | interest rate at origination time, in % |
| state_orig_time | US state in which the property is located |
| hpi_orig_time | house price index at observation time, base year=100 |
| first_time_date | date of the first observation |
| age | number of periods from origination to time of observation |
| TTM | time to maturity at the time of observation |
| last_time_date | date of the last observation |
| last_time | time stamp for the last observation |
| status_last | debt status at the last observation |
| interest_rate_mean | the average interest rate of all observation periods |
| gdp_mean | the average GDP growth of all observation periods |
| risk_free_mean | the average risk free rate of all observation periods |
| hpi_mean | the average house price index of all observation periods |
| uer_mean | the average unemployment rate of all observation periods |
| time_to_GFC | number of periods from first observation to global financial crisis |
Now, we will need to filter out mortgages that are unfinished.
mortgage_all <- mortgage_all %>% filter(status_last != 0)
Some of the columns that are not included will be discussed in the future.Now, with the new collapsed dataset, we can start constructing our model. Firstly, we would like to construct a correlation table to determine which varible to be included in our regression fuction.
variable_subset <- mortgage_all[c('first_time','orig_time','mat_time','rate_time','REtype_CO_orig_time','REtype_PU_orig_time','REtype_SF_orig_time','investor_orig_time','balance_orig_time','FICO_orig_time','LTV_orig_time','Interest_Rate_orig_time','hpi_orig_time','age','TTM','last_time','default_time','interest_rate_mean','gdp_mean','risk_free_mean','hpi_mean','uer_mean','time_to_GFC')]
cor(variable_subset) %>% knitr::kable()
| first_time | orig_time | mat_time | rate_time | REtype_CO_orig_time | REtype_PU_orig_time | REtype_SF_orig_time | investor_orig_time | balance_orig_time | FICO_orig_time | LTV_orig_time | Interest_Rate_orig_time | hpi_orig_time | age | TTM | last_time | default_time | interest_rate_mean | gdp_mean | risk_free_mean | hpi_mean | uer_mean | time_to_GFC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| first_time | 1.0000000 | 0.7204394 | 0.4121585 | 0.1138200 | 0.0292483 | 0.0229569 | -0.0307036 | 0.0116296 | 0.1546413 | 0.0824204 | 0.0092745 | 0.0687341 | 0.6929031 | 0.2299836 | 0.1156180 | 0.6148660 | 0.3039482 | -0.1023105 | -0.5424740 | 0.1136880 | 0.1369459 | 0.2936127 | -1.0000000 |
| orig_time | 0.7204394 | 1.0000000 | 0.5918757 | -0.0615906 | 0.0407945 | 0.0610640 | -0.0599190 | 0.0432014 | 0.2098406 | 0.1553459 | -0.0164785 | 0.0160576 | 0.9160574 | -0.5092385 | 0.1846039 | 0.4593792 | 0.2978776 | -0.1435913 | -0.4828428 | -0.0616145 | 0.1433444 | 0.1679640 | -0.7204394 |
| mat_time | 0.4121585 | 0.5918757 | 1.0000000 | -0.0149594 | 0.0350942 | 0.0605812 | -0.0418815 | -0.0198346 | 0.1676437 | 0.0631609 | 0.0632089 | -0.1081868 | 0.5743841 | -0.3190921 | 0.9014386 | 0.2064687 | 0.2295032 | -0.1050570 | -0.2702708 | -0.0153298 | 0.1000367 | 0.0480902 | -0.4121585 |
| rate_time | 0.1138200 | -0.0615906 | -0.0149594 | 1.0000000 | 0.0064903 | -0.0117604 | 0.0253148 | 0.0114580 | -0.0313978 | -0.0780419 | 0.0118861 | 0.1827828 | 0.1063109 | 0.2276740 | 0.0148391 | 0.0809171 | 0.1885953 | 0.2813221 | -0.3139753 | 0.9996952 | -0.1837950 | 0.0154899 | -0.1138200 |
| REtype_CO_orig_time | 0.0292483 | 0.0407945 | 0.0350942 | 0.0064903 | 1.0000000 | -0.0965647 | -0.3415917 | 0.0207958 | -0.0089662 | 0.0915365 | 0.0098910 | -0.0238453 | 0.0521295 | -0.0209500 | 0.0208812 | 0.0322690 | 0.0054182 | -0.0650387 | -0.0295951 | 0.0063323 | 0.0047353 | 0.0114768 | -0.0292483 |
| REtype_PU_orig_time | 0.0229569 | 0.0610640 | 0.0605812 | -0.0117604 | -0.0965647 | 1.0000000 | -0.4690949 | -0.0166762 | 0.0958164 | 0.1242780 | -0.0048864 | -0.0248709 | 0.0658115 | -0.0572009 | 0.0410717 | 0.0454316 | 0.0126085 | -0.1162657 | -0.0339186 | -0.0115687 | 0.0031630 | 0.0253864 | -0.0229569 |
| REtype_SF_orig_time | -0.0307036 | -0.0599190 | -0.0418815 | 0.0253148 | -0.3415917 | -0.4690949 | 1.0000000 | -0.1013132 | -0.0462944 | -0.1319948 | 0.0041750 | -0.0137843 | -0.0682895 | 0.0459809 | -0.0188856 | -0.0405454 | -0.0039875 | 0.0730690 | 0.0052115 | 0.0252423 | -0.0541469 | -0.0119068 | 0.0307036 |
| investor_orig_time | 0.0116296 | 0.0432014 | -0.0198346 | 0.0114580 | 0.0207958 | -0.0166762 | -0.1013132 | 1.0000000 | -0.1168575 | 0.1699794 | -0.0639457 | 0.0956060 | 0.0435371 | -0.0461916 | -0.0473878 | 0.0412416 | 0.0339573 | 0.0340316 | -0.0332389 | 0.0114132 | -0.0052367 | 0.0242550 | -0.0116296 |
| balance_orig_time | 0.1546413 | 0.2098406 | 0.1676437 | -0.0313978 | -0.0089662 | 0.0958164 | -0.0462944 | -0.1168575 | 1.0000000 | 0.3229090 | -0.1935153 | -0.0835764 | 0.1884202 | -0.1025605 | 0.0917100 | 0.1515442 | -0.0010766 | -0.3375744 | -0.1397382 | -0.0314263 | -0.0176242 | 0.1100034 | -0.1546413 |
| FICO_orig_time | 0.0824204 | 0.1553459 | 0.0631609 | -0.0780419 | 0.0915365 | 0.1242780 | -0.1319948 | 0.1699794 | 0.3229090 | 1.0000000 | -0.0995956 | -0.0710463 | 0.1558017 | -0.1157125 | -0.0064203 | 0.2040493 | -0.1082498 | -0.5259041 | -0.1282160 | -0.0777834 | -0.0048778 | 0.1430426 | -0.0824204 |
| LTV_orig_time | 0.0092745 | -0.0164785 | 0.0632089 | 0.0118861 | 0.0098910 | -0.0048864 | 0.0041750 | -0.0639457 | -0.1935153 | -0.0995956 | 1.0000000 | -0.0021603 | -0.0174270 | 0.0346331 | 0.0859226 | -0.0590717 | 0.0888124 | 0.2030012 | 0.0499672 | 0.0119659 | 0.0528514 | -0.0554563 | -0.0092745 |
| Interest_Rate_orig_time | 0.0687341 | 0.0160576 | -0.1081868 | 0.1827828 | -0.0238453 | -0.0248709 | -0.0137843 | 0.0956060 | -0.0835764 | -0.0710463 | -0.0021603 | 1.0000000 | 0.0467439 | 0.0627628 | -0.1405393 | 0.0873160 | 0.1052690 | 0.2503181 | -0.1808460 | 0.1828129 | -0.1402862 | 0.0448506 | -0.0687341 |
| hpi_orig_time | 0.6929031 | 0.9160574 | 0.5743841 | 0.1063109 | 0.0521295 | 0.0658115 | -0.0682895 | 0.0435371 | 0.1884202 | 0.1558017 | -0.0174270 | 0.0467439 | 1.0000000 | -0.4256156 | 0.2083603 | 0.4535288 | 0.3074560 | -0.1397604 | -0.4880450 | 0.1062585 | 0.2426822 | 0.0863770 | -0.6929031 |
| age | 0.2299836 | -0.5092385 | -0.3190921 | 0.2276740 | -0.0209500 | -0.0572009 | 0.0459809 | -0.0461916 | -0.1025605 | -0.1157125 | 0.0346331 | 0.0627628 | -0.4256156 | 1.0000000 | -0.1155728 | 0.1183877 | -0.0408169 | 0.0745348 | 0.0043735 | 0.2275439 | -0.0312074 | 0.1286613 | -0.2299836 |
| TTM | 0.1156180 | 0.1846039 | 0.9014386 | 0.0148391 | 0.0208812 | 0.0410717 | -0.0188856 | -0.0473878 | 0.0917100 | -0.0064203 | 0.0859226 | -0.1405393 | 0.2083603 | -0.1155728 | 1.0000000 | 0.0050268 | 0.1198535 | -0.0509779 | -0.0702203 | 0.0144003 | 0.0449892 | -0.0315734 | -0.1156180 |
| last_time | 0.6148660 | 0.4593792 | 0.2064687 | 0.0809171 | 0.0322690 | 0.0454316 | -0.0405454 | 0.0412416 | 0.1515442 | 0.2040493 | -0.0590717 | 0.0873160 | 0.4535288 | 0.1183877 | 0.0050268 | 1.0000000 | 0.2975575 | -0.2200291 | -0.6949821 | 0.0814640 | -0.2450211 | 0.7209687 | -0.6148660 |
| default_time | 0.3039482 | 0.2978776 | 0.2295032 | 0.1885953 | 0.0054182 | 0.0126085 | -0.0039875 | 0.0339573 | -0.0010766 | -0.1082498 | 0.0888124 | 0.1052690 | 0.3074560 | -0.0408169 | 0.1198535 | 0.2975575 | 1.0000000 | 0.1516563 | -0.4493496 | 0.1884337 | -0.1671072 | 0.1808916 | -0.3039482 |
| interest_rate_mean | -0.1023105 | -0.1435913 | -0.1050570 | 0.2813221 | -0.0650387 | -0.1162657 | 0.0730690 | 0.0340316 | -0.3375744 | -0.5259041 | 0.2030012 | 0.2503181 | -0.1397604 | 0.0745348 | -0.0509779 | -0.2200291 | 0.1516563 | 1.0000000 | -0.0229672 | 0.2810567 | -0.1204072 | -0.1729804 | 0.1023105 |
| gdp_mean | -0.5424740 | -0.4828428 | -0.2702708 | -0.3139753 | -0.0295951 | -0.0339186 | 0.0052115 | -0.0332389 | -0.1397382 | -0.1282160 | 0.0499672 | -0.1808460 | -0.4880450 | 0.0043735 | -0.0702203 | -0.6949821 | -0.4493496 | -0.0229672 | 1.0000000 | -0.3140921 | 0.4424599 | -0.5565366 | 0.5424740 |
| risk_free_mean | 0.1136880 | -0.0616145 | -0.0153298 | 0.9996952 | 0.0063323 | -0.0115687 | 0.0252423 | 0.0114132 | -0.0314263 | -0.0777834 | 0.0119659 | 0.1828129 | 0.1062585 | 0.2275439 | 0.0144003 | 0.0814640 | 0.1884337 | 0.2810567 | -0.3140921 | 1.0000000 | -0.1840000 | 0.0159213 | -0.1136880 |
| hpi_mean | 0.1369459 | 0.1433444 | 0.1000367 | -0.1837950 | 0.0047353 | 0.0031630 | -0.0541469 | -0.0052367 | -0.0176242 | -0.0048778 | 0.0528514 | -0.1402862 | 0.2426822 | -0.0312074 | 0.0449892 | -0.2450211 | -0.1671072 | -0.1204072 | 0.4424599 | -0.1840000 | 1.0000000 | -0.6671533 | -0.1369459 |
| uer_mean | 0.2936127 | 0.1679640 | 0.0480902 | 0.0154899 | 0.0114768 | 0.0253864 | -0.0119068 | 0.0242550 | 0.1100034 | 0.1430426 | -0.0554563 | 0.0448506 | 0.0863770 | 0.1286613 | -0.0315734 | 0.7209687 | 0.1808916 | -0.1729804 | -0.5565366 | 0.0159213 | -0.6671533 | 1.0000000 | -0.2936127 |
| time_to_GFC | -1.0000000 | -0.7204394 | -0.4121585 | -0.1138200 | -0.0292483 | -0.0229569 | 0.0307036 | -0.0116296 | -0.1546413 | -0.0824204 | -0.0092745 | -0.0687341 | -0.6929031 | -0.2299836 | -0.1156180 | -0.6148660 | -0.3039482 | 0.1023105 | 0.5424740 | -0.1136880 | -0.1369459 | -0.2936127 | 1.0000000 |
Now I will make run a regression between default_rate and each variable and plot them with ggplot. For values like GDP,House Price Index, Unemployment Rate with both mean and first-time value, I will only focus on the one with stronger correlation to default rate.I will start by the one with the strongest to the weakest.
cor(variable_subset,variable_subset$default_time)
## [,1]
## first_time 0.303948205
## orig_time 0.297877579
## mat_time 0.229503177
## rate_time 0.188595324
## REtype_CO_orig_time 0.005418208
## REtype_PU_orig_time 0.012608515
## REtype_SF_orig_time -0.003987520
## investor_orig_time 0.033957315
## balance_orig_time -0.001076587
## FICO_orig_time -0.108249816
## LTV_orig_time 0.088812382
## Interest_Rate_orig_time 0.105268981
## hpi_orig_time 0.307455995
## age -0.040816929
## TTM 0.119853510
## last_time 0.297557465
## default_time 1.000000000
## interest_rate_mean 0.151656297
## gdp_mean -0.449349609
## risk_free_mean 0.188433686
## hpi_mean -0.167107236
## uer_mean 0.180891564
## time_to_GFC -0.303948205
mean GDP:
library(modelr)
mor_mod <- glm(default_time ~ gdp_mean, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod))
## (Intercept) gdp_mean
## 1.0324845 -0.8995499
library(modelr)
(grid <- mortgage_all %>% data_grid(gdp_mean) %>%
add_predictions(mor_mod, type = "response"))
## # A tibble: 1,447 x 2
## gdp_mean pred
## <dbl> <dbl>
## 1 -3.83 0.989
## 2 -3.74 0.988
## 3 -3.67 0.987
## 4 -3.52 0.985
## 5 -3.16 0.980
## 6 -2.82 0.973
## 7 -2.81 0.972
## 8 -2.70 0.969
## 9 -2.58 0.966
## 10 -2.39 0.960
## # … with 1,437 more rows
ggplot(mortgage_all, aes(gdp_mean)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

pattern: As gdp mean increases roughly from 3.75 to 5.0, the default rate shows a decrease pattern from 1 to 0. The dots are mainly gathered from 0 to 2.5 gdp mean for both default and pay-off.
Original House Price Index:
mor_mod1 <- glm(default_time ~ hpi_orig_time, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod1))
## (Intercept) hpi_orig_time
## -5.41944535 0.02425822
library(modelr)
(grid <- mortgage_all %>% data_grid(hpi_orig_time) %>%
add_predictions(mor_mod1, type = "response"))
## # A tibble: 87 x 2
## hpi_orig_time pred
## <dbl> <dbl>
## 1 75.7 0.0270
## 2 75.7 0.0271
## 3 75.9 0.0272
## 4 76.4 0.0275
## 5 76.5 0.0275
## 6 76.5 0.0275
## 7 76.7 0.0277
## 8 76.9 0.0278
## 9 77.0 0.0279
## 10 77 0.0279
## # … with 77 more rows
ggplot(mortgage_all, aes(hpi_orig_time)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

pattern: There is a weak correlation between the original house price index (base year = 100) and default rate. As the original house price index increases from 75 to 200, the default rate increases at the same time from 0 to 0.5.
For time varible, I will keep only origin time of the dataset:
mor_mod2 <- glm(default_time ~ orig_time, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod2))
## (Intercept) orig_time
## -3.1697610 0.1243071
library(modelr)
(grid <- mortgage_all %>% data_grid(orig_time) %>%
add_predictions(mor_mod2, type = "response"))
## # A tibble: 88 x 2
## orig_time pred
## <dbl> <dbl>
## 1 -40 0.000291
## 2 -39 0.000329
## 3 -37 0.000422
## 4 -36 0.000478
## 5 -33 0.000694
## 6 -30 0.00101
## 7 -29 0.00114
## 8 -28 0.00129
## 9 -26 0.00166
## 10 -25 0.00187
## # … with 78 more rows
ggplot(mortgage_all, aes(orig_time)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

pattern: As we can see, the default rate tends to be 0 when time stamp for start of the mortgage from -40 to 0. The default rate increases dramatically from 0 to 40, and remains at 1 from 40 to 60.
First time Risk-free rate:
mor_mod4 <- glm(default_time ~ rate_time, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod4))
## (Intercept) rate_time
## -4.2555650 0.8188854
library(modelr)
(grid <- mortgage_all %>% data_grid(rate_time) %>%
add_predictions(mor_mod4, type = "response"))
## # A tibble: 81 x 2
## rate_time pred
## <dbl> <dbl>
## 1 1.65 0.0519
## 2 1.67 0.0527
## 3 1.78 0.0574
## 4 1.87 0.0616
## 5 1.89 0.0625
## 6 1.92 0.0640
## 7 1.94 0.0650
## 8 2.17 0.0774
## 9 2.23 0.0810
## 10 2.52 0.100
## # … with 71 more rows
ggplot(mortgage_all, aes(rate_time)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

Pattern: There is a strong correlation between these two variables. As first time risk-free rate increases, the default rate increases dramatically from 0 to 1.
Unemployment Rate:
mor_mod5 <- glm(default_time ~ uer_mean, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod5))
## (Intercept) uer_mean
## -2.5602844 0.3632487
library(modelr)
(grid <- mortgage_all %>% data_grid(uer_mean) %>%
add_predictions(mor_mod5, type = "response"))
## # A tibble: 1,113 x 2
## uer_mean pred
## <dbl> <dbl>
## 1 3.8 0.235
## 2 3.9 0.242
## 3 3.95 0.245
## 4 3.98 0.247
## 5 4 0.248
## 6 4.03 0.251
## 7 4.05 0.252
## 8 4.06 0.252
## 9 4.12 0.257
## 10 4.15 0.259
## # … with 1,103 more rows
ggplot(mortgage_all, aes(uer_mean)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

Pattern: There is a positive linear relationship between the average unemployment rate and default rate. As the average unemployment rate increases from roughly 4 to 10, the default rate increases from 0.25 to 0.75.
Mean Interest Rate:
mor_mod6 <- glm(default_time ~ interest_rate_mean, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod6))
## (Intercept) interest_rate_mean
## -1.9736370 0.1906897
library(modelr)
(grid <- mortgage_all %>% data_grid(interest_rate_mean) %>%
add_predictions(mor_mod6, type = "response"))
## # A tibble: 7,527 x 2
## interest_rate_mean pred
## <dbl> <dbl>
## 1 1.12 0.147
## 2 1.72 0.162
## 3 1.75 0.162
## 4 1.99 0.169
## 5 1.99 0.169
## 6 2.00 0.169
## 7 2 0.169
## 8 2.00 0.169
## 9 2.01 0.169
## 10 2.02 0.170
## # … with 7,517 more rows
ggplot(mortgage_all, aes(interest_rate_mean)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

Pattern: The default rate increases dramatically as the average interest rate of all observation periods from roughly 2 to 20, and default rate tends to be flat from 20 to roughly 37.
Time to Maturity:
mor_mod7 <- glm(default_time ~ TTM, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod7))
## (Intercept) TTM
## -3.22550171 0.02253902
library(modelr)
(grid <- mortgage_all %>% data_grid(TTM) %>%
add_predictions(mor_mod7, type = "response"))
## # A tibble: 146 x 2
## TTM pred
## <dbl> <dbl>
## 1 5 0.0426
## 2 9 0.0464
## 3 10 0.0474
## 4 11 0.0484
## 5 13 0.0506
## 6 14 0.0517
## 7 15 0.0528
## 8 16 0.0539
## 9 17 0.0551
## 10 18 0.0563
## # … with 136 more rows
ggplot(mortgage_all, aes(TTM)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

Pattern: There is a positive non-linear relationship between scheduled time to maturity and default rate. The slope increases slowly at the beginning, and start to increase relative deeply at roughly 100.
FICO score(credit score):
mor_mod8 <- glm(default_time ~ FICO_orig_time, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod8))
## (Intercept) FICO_orig_time
## 1.497799187 -0.003128735
library(modelr)
(grid <- mortgage_all %>% data_grid(FICO_orig_time) %>%
add_predictions(mor_mod8, type = "response"))
## # A tibble: 389 x 2
## FICO_orig_time pred
## <dbl> <dbl>
## 1 400 0.561
## 2 406 0.557
## 3 413 0.551
## 4 420 0.546
## 5 428 0.540
## 6 429 0.539
## 7 433 0.536
## 8 439 0.531
## 9 441 0.529
## 10 442 0.529
## # … with 379 more rows
ggplot(mortgage_all, aes(FICO_orig_time)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

pattern: The correlation between default rate and FICO score(credit score) at origination time is negative and weak, which indicates a small correlation between these two variables.
Loan to Value ratio:
mor_mod9 <- glm(default_time ~ LTV_orig_time, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod9))
## (Intercept) LTV_orig_time
## -2.09611716 0.01917789
library(modelr)
(grid <- mortgage_all %>% data_grid(LTV_orig_time) %>%
add_predictions(mor_mod9, type = "response"))
## # A tibble: 538 x 2
## LTV_orig_time pred
## <dbl> <dbl>
## 1 50.1 0.243
## 2 50.2 0.244
## 3 50.3 0.244
## 4 50.4 0.244
## 5 50.5 0.245
## 6 50.6 0.245
## 7 50.7 0.245
## 8 50.8 0.246
## 9 50.9 0.246
## 10 51 0.246
## # … with 528 more rows
ggplot(mortgage_all, aes(LTV_orig_time)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

Pattern: The correlation here looks pretty steady, except a little twit around LTV_orig_time = 175. This indicates an overall stable positive linear correlation between default time and loan-to-value ratio.
age of a mortgage:
mor_mod10 <- glm(default_time ~ age, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod10))
## (Intercept) age
## -0.49817484 -0.01776127
library(modelr)
(grid <- mortgage_all %>% data_grid(age) %>%
add_predictions(mor_mod10, type = "response"))
## # A tibble: 66 x 2
## age pred
## <dbl> <dbl>
## 1 0 0.378
## 2 1 0.374
## 3 2 0.370
## 4 3 0.366
## 5 4 0.361
## 6 5 0.357
## 7 6 0.353
## 8 7 0.349
## 9 8 0.345
## 10 9 0.341
## # … with 56 more rows
ggplot(mortgage_all, aes(age)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

pattern: The correlation here looks negatively linear but not strong, since the correlation line is relatively flat, which indicates a small tangent and small correlation between these two variables.
Investor Type:
ggplot(data=mortgage_all, mapping=aes(x= investor_orig_time,y=REtype_SF_orig_time))+
geom_bar(stat="identity")

Pattern:
Value of the house:
mor_mod12 <- glm(default_time ~ balance_orig_time, data = mortgage_all, family = binomial)
(beta <- coef(mor_mod12))
## (Intercept) balance_orig_time
## -5.567810e-01 -1.075221e-08
library(modelr)
(grid <- mortgage_all %>% data_grid(balance_orig_time) %>%
add_predictions(mor_mod12, type = "response"))
## # A tibble: 9,409 x 2
## balance_orig_time pred
## <dbl> <dbl>
## 1 0 0.364
## 2 6246 0.364
## 3 9575 0.364
## 4 10000 0.364
## 5 10465. 0.364
## 6 11100 0.364
## 7 11400 0.364
## 8 11845. 0.364
## 9 12000 0.364
## 10 12008 0.364
## # … with 9,399 more rows
ggplot(mortgage_all, aes(balance_orig_time)) + geom_point(aes(y = default_time),alpha = 0.005) +
geom_line(aes(y = pred), data = grid, color = "red", size = 1)

Pattern: There is almost no correlation between value of the house and default rate. The correlation line is totally flat, which means value of the house does not affect default time.
REtype_CO_orig_tim:condominium:
ggplot(data=mortgage_all, mapping=aes(x=default_time,y=REtype_CO_orig_time))+
geom_bar(stat="identity")

Pattern: The ratio of pay-off to default in condominium is around 1:2, with pay-off approximately 1000 and default approximately 2000. This type of real estate has the largest difference in ratio between payoff and default.
REtype_PU_orig_time real estate type planned urban developments:
ggplot(data=mortgage_all, mapping=aes(x=default_time,y=REtype_PU_orig_time))+
geom_bar(stat="identity")

Pattern: The ratio of pay-off to default in planned urban developments is around 3:5, with pay-off approximately 1800 and default approximately 3000.
REtype_SF_orig_time single family home:
ggplot(data=mortgage_all, mapping=aes(x=default_time,y=REtype_SF_orig_time))+
geom_bar(stat="identity")

Pattern: The ratio of pay-off to default in single family home is around 19:33, with pay-off approximately 9500 and default approximately 16500. This type of real estate has the most counts. So, after ploting all these varible, we are now able to complete our full regression model:
library(modelr)
glm(default_time ~ age + first_time+REtype_CO_orig_time+REtype_PU_orig_time+REtype_SF_orig_time+
interest_rate_mean+gdp_mean+rate_time+hpi_orig_time+uer_mean+investor_orig_time+balance_orig_time+FICO_orig_time+LTV_orig_time+TTM- 1 , data = mortgage_all, family = binomial) %>% broom::tidy()
## # A tibble: 15 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 age -1.91e-2 0.00502 -3.81 1.38e- 4
## 2 first_time 2.27e-2 0.00489 4.65 3.35e- 6
## 3 REtype_CO_orig_time -1.05e-2 0.0540 -0.194 8.46e- 1
## 4 REtype_PU_orig_time 7.95e-2 0.0446 1.78 7.48e- 2
## 5 REtype_SF_orig_time -8.40e-2 0.0315 -2.67 7.62e- 3
## 6 interest_rate_mean 7.94e-2 0.00858 9.25 2.29e- 20
## 7 gdp_mean -9.64e-1 0.0145 -66.3 0
## 8 rate_time 8.65e-2 0.0287 3.01 2.62e- 3
## 9 hpi_orig_time 5.48e-3 0.000974 5.62 1.89e- 8
## 10 uer_mean -1.44e-1 0.0128 -11.2 3.14e- 29
## 11 investor_orig_time 4.06e-1 0.0391 10.4 3.20e- 25
## 12 balance_orig_time -5.74e-9 0.0000000641 -0.0895 9.29e- 1
## 13 FICO_orig_time -6.32e-3 0.000185 -34.2 2.96e-256
## 14 LTV_orig_time 2.34e-2 0.00130 18.1 3.16e- 73
## 15 TTM 1.39e-2 0.000991 14.0 1.22e- 44
As we can see from the chart, condominium or not, planned urban developments or not, the value of the house, is not statistically significant. Therefore, we will kick them out od our model.
library(modelr)
glm(default_time ~ age + first_time+REtype_SF_orig_time+interest_rate_mean+gdp_mean+rate_time+hpi_orig_time+uer_mean+investor_orig_time+FICO_orig_time+LTV_orig_time+TTM- 1 , data = mortgage_all, family = binomial) %>% broom::tidy()
## # A tibble: 12 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 age -0.0192 0.00500 -3.85 1.18e- 4
## 2 first_time 0.0226 0.00486 4.65 3.27e- 6
## 3 REtype_SF_orig_time -0.107 0.0249 -4.29 1.82e- 5
## 4 interest_rate_mean 0.0786 0.00846 9.29 1.58e- 20
## 5 gdp_mean -0.964 0.0145 -66.3 0
## 6 rate_time 0.0880 0.0287 3.07 2.13e- 3
## 7 hpi_orig_time 0.00549 0.000973 5.64 1.68e- 8
## 8 uer_mean -0.144 0.0128 -11.2 3.52e- 29
## 9 investor_orig_time 0.401 0.0383 10.5 1.34e- 25
## 10 FICO_orig_time -0.00631 0.000179 -35.3 7.50e-273
## 11 LTV_orig_time 0.0235 0.00128 18.4 1.39e- 75
## 12 TTM 0.0140 0.000987 14.1 2.31e- 45
Generally speaking, GDP growth, uemployment rate,and the mean interest rate is still the outside dominant factor that may influence one’s default rate. In addition to that, investor house or not, loan to house value ratio, time to maturity are also the inside factor that determines one’s default rate.Therefore, whether one will default or not in a mortgage is strongly related to both the economy of the country and the types of specific mortgage.