Final Decision of Our Data

2021-10-25

Mortgage Market analysis in the US

This dataset is from http://www.deepcreditrisk.com/data--code-private.html

Description:

The data set is in panel form and reports origination and performance observations for 50000 residential U.S. mortgage borrowers over 60 peiords. The periods have been deidentified. As in the real world, loans may originate before the start of the observation period. The loan observations may thus be censored as the loans mature or borrowers refinance. The data set is a randomized selection of mortgage-loan-level data collected from the portfolios underlying U.S. residential mortgage-backed securities (RMBS) securitization portfolios. Such randomization gives us a fair representation of US mortgage market development. We would like to combine the dataset with other market indicator to further measure the probability of default for personal investors.

Key variables include:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
str<-'id: borrower id
time: time stamp of observation
orig_time: time stamp for origination
first_time: time stamp for first observation
mat_time: time stamp for maturity
res_time: time stamp for resolution
balance_time: outstanding balance at observation time
LTV_time: loan to value ratio at observation time, in %
interest_rate_time: interest rate at observation time, in %
rate_time: risk-free rate
hpi_time: house price index at observation time, base year=100
gdp_time: GDP growth at observation time, in %
uer_time: unemployment rate at observation time, in %
REtype_CO_orig_time: real estate type condominium: 1, otherwise: 0
REtype_PU_orig_time: real estate type planned urban developments: 1, otherwise: 0
REtype_SF_orig_time: single family home: 1, otherwise: 0
investor_orig_time: investor borrower: 1, otherwise: 0
balance_orig_time: outstanding balance at origination time
FICO_orig_time: FICO score at origination time, in %
LTV_orig_time: loan to value ratio at origination time, in %
Interest_Rate_orig_time: interest rate at origination time, in %
state_orig_time: US state in which the property is located
hpi_orig_time: house price index at observation time, base year=100
default_time: default observation at observation time
payoff_time: payoff observation at observation time
status_time: default (1), payoff (2) and non-default/non-payoff (0) observation at observation time
lgd_time: LGD assuming no discounting of cash flows
recovery_res: sum of all cash flows received during resolution period'

lista<-str_split(str, "\n")
keydataname<-data.frame(lista)
colnames(keydataname)<-'dataname'
keydataname<-keydataname %>% separate(dataname,into=c('Column Name','Description'),sep=':')
## Warning: Expected 2 pieces. Additional pieces discarded in 4 rows [14, 15, 16,
## 17].
print(as.tibble(keydataname), n = 28)
## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## # A tibble: 28 x 2
##    `Column Name`        Description                                             
##    <chr>                <chr>                                                   
##  1 id                   " borrower id"                                          
##  2 time                 " time stamp of observation"                            
##  3 orig_time            " time stamp for origination"                           
##  4 first_time           " time stamp for first observation"                     
##  5 mat_time             " time stamp for maturity"                              
##  6 res_time             " time stamp for resolution"                            
##  7 balance_time         " outstanding balance at observation time"              
##  8 LTV_time             " loan to value ratio at observation time, in %"        
##  9 interest_rate_time   " interest rate at observation time, in %"              
## 10 rate_time            " risk-free rate"                                       
## 11 hpi_time             " house price index at observation time, base year=100" 
## 12 gdp_time             " GDP growth at observation time, in %"                 
## 13 uer_time             " unemployment rate at observation time, in %"          
## 14 REtype_CO_orig_time  " real estate type condominium"                         
## 15 REtype_PU_orig_time  " real estate type planned urban developments"          
## 16 REtype_SF_orig_time  " single family home"                                   
## 17 investor_orig_time   " investor borrower"                                    
## 18 balance_orig_time    " outstanding balance at origination time"              
## 19 FICO_orig_time       " FICO score at origination time, in %"                 
## 20 LTV_orig_time        " loan to value ratio at origination time, in %"        
## 21 Interest_Rate_orig_… " interest rate at origination time, in %"              
## 22 state_orig_time      " US state in which the property is located"            
## 23 hpi_orig_time        " house price index at observation time, base year=100" 
## 24 default_time         " default observation at observation time"              
## 25 payoff_time          " payoff observation at observation time"               
## 26 status_time          " default (1), payoff (2) and non-default/non-payoff (0…
## 27 lgd_time             " LGD assuming no discounting of cash flows"            
## 28 recovery_res         " sum of all cash flows received during resolution peri…

The blog posts together should address the following topics.

Data Loading and Cleaning

library(tidyverse)
mortgage <-read_csv("dcr_full.csv", col_types=cols(
  id = col_double(),
  time = col_double(),
  orig_time = col_double(),
  first_time = col_double(),
  mat_time = col_double(),
  res_time = col_double(),
  balance_time = col_double(),
  LTV_time = col_double(),
  interest_rate_time = col_double(),
  rate_time = col_double(),
  hpi_time = col_double(),
  gdp_time = col_double(),
  uer_time = col_double(),
  REtype_CO_orig_time = col_double(),
  iREtype_PU_orig_time = col_double(),
  REtype_SF_orig_time = col_double(),
  investor_orig_time = col_double(),
  balance_orig_time = col_double(),
  FICO_orig_time = col_double(),
  LTV_orig_time = col_double(),
  Interest_Rate_orig_time = col_double(),
  state_orig_time = col_character(),
  hpi_orig_time = col_double(),
  default_time = col_double(),
  payoff_time = col_double(),
  status_time = col_character()
  ))
## Warning: The following named parsers don't match the column names:
## iREtype_PU_orig_time
mortgage
## # A tibble: 622,489 x 28
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1     1    25        -7         25      113       NA       41303.     24.5
##  2     1    26        -7         25      113       NA       41062.     24.5
##  3     1    27        -7         25      113       NA       40804.     24.6
##  4     1    28        -7         25      113       NA       40484.     24.7
##  5     1    29        -7         25      113       NA       40367.     24.9
##  6     1    30        -7         25      113       NA       40128.     25.3
##  7     1    31        -7         25      113       NA       39719.     26.6
##  8     1    32        -7         25      113       NA       35877.     25.9
##  9     1    33        -7         25      113       NA       34410.     25.6
## 10     1    34        -7         25      113       NA       33590.     26.0
## # … with 622,479 more rows, and 20 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>
problems()

If you are working with a large data set, you might decide to start with a subset of the data. How did you choose this?

I use a random number generator because I do not want my subset to be biased and stratified.

Discuss any initial steps you are taking to load and clean the data.

Our Idea is to make a logistics regression to see any variable strongly influence people’s choice between payoff and default, so the first step for us to clean the data is to only select those who already reach the deadline (eliminate those have not reached the maturity) while 2 represents payoff, 1 represents default, what we do is to select only the status_time = 1 or 2

(mortgage <- mortgage %>% filter(status_time > 0))
## # A tibble: 41,747 x 28
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1     1    48        -7         25      113       50       29087.     26.7
##  2     2    26        18         25      138       NA      105655.     65.5
##  3     3    29        -6         25      114       NA       44379.     31.5
##  4     5    27        18         25      138       NA       52101.     66.3
##  5     6    56        19         25      139       60      190474.     75.8
##  6     7    26        18         25      138       NA      107916.     77.9
##  7     8    25        18         25      138       NA      152393.     58.8
##  8     9    37        18         25      138       NA      130140.     99.1
##  9    10    29        18         25      139       34       88046.     67.3
## 10    11    27        18         25      138       NA      154664.     78.8
## # … with 41,737 more rows, and 20 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>
(mortgage %>% ggplot(aes(x = balance_time)) + geom_histogram())
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

sqrt_result <- sqrt(var(mortgage$balance_time))
mean_result <-mean(mortgage$balance_time)
(threshold <- mean_result + 2 * sqrt_result)
## [1] 653053.3
(mortgage <- mortgage %>% filter(balance_time < 422225.7))
## # A tibble: 35,612 x 28
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1     1    48        -7         25      113       50       29087.     26.7
##  2     2    26        18         25      138       NA      105655.     65.5
##  3     3    29        -6         25      114       NA       44379.     31.5
##  4     5    27        18         25      138       NA       52101.     66.3
##  5     6    56        19         25      139       60      190474.     75.8
##  6     7    26        18         25      138       NA      107916.     77.9
##  7     8    25        18         25      138       NA      152393.     58.8
##  8     9    37        18         25      138       NA      130140.     99.1
##  9    10    29        18         25      139       34       88046.     67.3
## 10    11    27        18         25      138       NA      154664.     78.8
## # … with 35,602 more rows, and 20 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>
mortgage %>% ggplot(aes(x = balance_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(x = status_time, y = balance_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = LTV_time)) + geom_boxplot()
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).

mortgage %>% ggplot(aes(x = status_time, y = interest_rate_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = Interest_Rate_orig_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = gdp_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = uer_time)) + geom_boxplot()

# If you are working with a large data set, you might decide to start with a subset of the data. How did you choose this?

mortgage %>% mutate(count = n())%>% mutate(random = runif(count,1.0,10000.0)) %>% filter(random < 1000)
## # A tibble: 3,542 x 30
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1     2    26        18         25      138       NA      105655.     65.5
##  2     9    37        18         25      138       NA      130140.     99.1
##  3    19    28        19         25      139       30      187753.     68.2
##  4    21    28        18         25      139       NA      197993.     67.9
##  5    29    33        18         25      139       38      177668.     80.0
##  6    32    25        19         25      139       NA      238082.     67.1
##  7    34    26        19         25      139       NA      257601.     67.4
##  8    62    36        25         33      140       NA       68553.    133. 
##  9    68    38        25         33      141       NA       82984.    113. 
## 10    73    41        25         33      140       NA       93837.    109. 
## # … with 3,532 more rows, and 22 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>, count <int>, random <dbl>
mortgage %>% ggplot(aes(id)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% count(time)
## # A tibble: 60 x 2
##     time     n
##    <dbl> <int>
##  1     1    22
##  2     2    37
##  3     3    57
##  4     4    56
##  5     5    65
##  6     6    95
##  7     7    80
##  8     8    72
##  9     9    97
## 10    10    74
## # … with 50 more rows
mortgage %>% count(orig_time)
## # A tibble: 72 x 2
##    orig_time     n
##        <dbl> <int>
##  1       -40    43
##  2       -39     3
##  3       -37     2
##  4       -36     1
##  5       -33     2
##  6       -30     3
##  7       -29     3
##  8       -28     2
##  9       -26     3
## 10       -25     2
## # … with 62 more rows
mortgage %>% count(first_time)
## # A tibble: 41 x 2
##    first_time     n
##         <dbl> <int>
##  1          1   276
##  2          2   267
##  3          3   387
##  4          4    16
##  5          5    24
##  6          6   102
##  7          9   135
##  8         10    71
##  9         11   175
## 10         12   313
## # … with 31 more rows
mortgage %>% count(mat_time)
## # A tibble: 175 x 2
##    mat_time     n
##       <dbl> <int>
##  1       23     1
##  2       26     1
##  3       27     1
##  4       28     1
##  5       29     4
##  6       30     1
##  7       31     3
##  8       32     4
##  9       33     2
## 10       34     3
## # … with 165 more rows
mortgage %>% count(res_time)
## # A tibble: 58 x 2
##    res_time     n
##       <dbl> <int>
##  1        4     3
##  2        5     2
##  3        6     5
##  4        7     7
##  5        8     4
##  6        9    11
##  7       10     9
##  8       11    11
##  9       12    16
## 10       13    14
## # … with 48 more rows
mortgage %>% ggplot(aes(balance_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% filter(balance_time < 0)
## # A tibble: 0 x 28
## # … with 28 variables: id <dbl>, time <dbl>, orig_time <dbl>, first_time <dbl>,
## #   mat_time <dbl>, res_time <dbl>, balance_time <dbl>, LTV_time <dbl>,
## #   interest_rate_time <dbl>, rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>,
## #   uer_time <dbl>, REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>
mortgage %>% count(LTV_time)
## # A tibble: 32,926 x 2
##    LTV_time     n
##       <dbl> <int>
##  1   0        285
##  2   0.0430     1
##  3   0.0462     1
##  4   0.0554     1
##  5   0.0939     1
##  6   0.112      1
##  7   0.133      1
##  8   0.143      1
##  9   0.148      1
## 10   0.169      1
## # … with 32,916 more rows
mortgage %>% ggplot(aes(LTV_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (stat_bin).

mortgage %>% count(LTV_time, balance_time)
## # A tibble: 35,263 x 3
##    LTV_time balance_time     n
##       <dbl>        <dbl> <int>
##  1   0               0     285
##  2   0.0430        351.      1
##  3   0.0462         24.9     1
##  4   0.0554        455.      1
##  5   0.0939        262.      1
##  6   0.112         530.      1
##  7   0.133        1271.      1
##  8   0.143         673.      1
##  9   0.148         229       1
## 10   0.169        1508.      1
## # … with 35,253 more rows
mortgage %>% ggplot(aes(interest_rate_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(rate_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(hpi_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(hpi_orig_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(gdp_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(uer_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% count(REtype_CO_orig_time)
## # A tibble: 2 x 2
##   REtype_CO_orig_time     n
##                 <dbl> <int>
## 1                   0 33192
## 2                   1  2420
mortgage %>% count(REtype_PU_orig_time)
## # A tibble: 2 x 2
##   REtype_PU_orig_time     n
##                 <dbl> <int>
## 1                   0 31730
## 2                   1  3882
mortgage %>% count(REtype_SF_orig_time)
## # A tibble: 2 x 2
##   REtype_SF_orig_time     n
##                 <dbl> <int>
## 1                   0 13217
## 2                   1 22395
mortgage %>% count(investor_orig_time)
## # A tibble: 2 x 2
##   investor_orig_time     n
##                <dbl> <int>
## 1                  0 31141
## 2                  1  4471
mortgage %>% ggplot(aes(balance_orig_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% filter(balance_time < 0)
## # A tibble: 0 x 28
## # … with 28 variables: id <dbl>, time <dbl>, orig_time <dbl>, first_time <dbl>,
## #   mat_time <dbl>, res_time <dbl>, balance_time <dbl>, LTV_time <dbl>,
## #   interest_rate_time <dbl>, rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>,
## #   uer_time <dbl>, REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>
mortgage %>% ggplot(aes(FICO_orig_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(LTV_orig_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(Interest_Rate_orig_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% filter(Interest_Rate_orig_time < 0)
## # A tibble: 0 x 28
## # … with 28 variables: id <dbl>, time <dbl>, orig_time <dbl>, first_time <dbl>,
## #   mat_time <dbl>, res_time <dbl>, balance_time <dbl>, LTV_time <dbl>,
## #   interest_rate_time <dbl>, rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>,
## #   uer_time <dbl>, REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>
mortgage %>% count(default_time)
## # A tibble: 2 x 2
##   default_time     n
##          <dbl> <int>
## 1            0 22825
## 2            1 12787
mortgage %>% count(payoff_time)
## # A tibble: 2 x 2
##   payoff_time     n
##         <dbl> <int>
## 1           0 12787
## 2           1 22825
mortgage %>% count(status_time)
## # A tibble: 2 x 2
##   status_time     n
##   <chr>       <int>
## 1 1           12787
## 2 2           22825
mortgage %>% count(state_orig_time)
## # A tibble: 54 x 2
##    state_orig_time     n
##    <chr>           <int>
##  1 AK                 23
##  2 AL                223
##  3 AR                110
##  4 AZ               1533
##  5 CA               6554
##  6 CO                828
##  7 CT                381
##  8 DC                 87
##  9 DE                 88
## 10 FL               3981
## # … with 44 more rows
mortage1 <- mortgage %>% filter(!is.na(state_orig_time))

To clean the data, we use count function to check categorical value and geom_histogram to check numeral value. The Zero values in LTV and outstanding balance matches with each other(When the household pay off the debts). Since they are meaningful, we decide to leave them. Some of the variables like res_time has a lot of NA vlaues, but they are meaningful. We should keep them. However, there are 2829 NA values in state variable. Since they are meaningless, and the number is relatively small, we decided to filter them out.

One thing worths noticing is that the time of the dataset has been deidentified.It is difficult for us to combine ours with other datasets. Therefore, we decide to refill the exact time back. We do this by comparing the GDP growth data inside the dataset:

mortage1 %>% count(gdp_time,time) %>% ggplot(aes(time,gdp_time)) + geom_line()

In US history, it is rare to see negative GDP growth. Therefore we check from the website and found out that the only peroid is US history with around -4.11% GDP growth, which is in the first quarter in 2009. After that, we compare other time and GDP growth, found out that the whole dataset is separated into 60 quaters from 2000 Q1 to 2014 Q4. Therefore, we can add the exact timeback to our dataset:

date_seq<-seq(as.Date("2000/03/31"), by = "quarter", length.out = 60)
date_dataframe<-data.frame(date_seq)
date_dataframe<- date_dataframe%>%separate(date_seq,into=c('year','month','day'),sep='-')
date_dataframe$quater=1
date_dataframe$quater[date_dataframe$month=='07']<-2
date_dataframe$quater[date_dataframe$month=='10']<-3
date_dataframe$quater[date_dataframe$month=='12']<-4
date_dataframe$time<-seq(1,60,1)
mortgage2 <- mortage1 %>% left_join(date_dataframe, by = "time")
mortgage2
## # A tibble: 35,449 x 32
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1     1    48        -7         25      113       50       29087.     26.7
##  2     2    26        18         25      138       NA      105655.     65.5
##  3     3    29        -6         25      114       NA       44379.     31.5
##  4     5    27        18         25      138       NA       52101.     66.3
##  5     6    56        19         25      139       60      190474.     75.8
##  6     7    26        18         25      138       NA      107916.     77.9
##  7     8    25        18         25      138       NA      152393.     58.8
##  8     9    37        18         25      138       NA      130140.     99.1
##  9    10    29        18         25      139       34       88046.     67.3
## 10    11    27        18         25      138       NA      154664.     78.8
## # … with 35,439 more rows, and 24 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>, year <chr>, month <chr>, day <chr>, quater <dbl>

After loading and cleaning, mortgage2 is the dataset we are working on.

(mortgage3 <- mortgage2 %>% filter(status_time > 0))
## # A tibble: 35,449 x 32
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1     1    48        -7         25      113       50       29087.     26.7
##  2     2    26        18         25      138       NA      105655.     65.5
##  3     3    29        -6         25      114       NA       44379.     31.5
##  4     5    27        18         25      138       NA       52101.     66.3
##  5     6    56        19         25      139       60      190474.     75.8
##  6     7    26        18         25      138       NA      107916.     77.9
##  7     8    25        18         25      138       NA      152393.     58.8
##  8     9    37        18         25      138       NA      130140.     99.1
##  9    10    29        18         25      139       34       88046.     67.3
## 10    11    27        18         25      138       NA      154664.     78.8
## # … with 35,439 more rows, and 24 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>, year <chr>, month <chr>, day <chr>, quater <dbl>
mortgage2 %>% ggplot(aes(x = balance_time)) + geom_histogram(binwidth = 10000)

These are patterns we found:

sqrt_result <- sqrt(var(mortgage$balance_time))
mean_result <-mean(mortgage$balance_time)
(threshold <- mean_result + 2 * sqrt_result)
## [1] 387104.2
(mortgage <- mortgage %>% filter(balance_time < 422225.7))
## # A tibble: 35,612 x 28
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1     1    48        -7         25      113       50       29087.     26.7
##  2     2    26        18         25      138       NA      105655.     65.5
##  3     3    29        -6         25      114       NA       44379.     31.5
##  4     5    27        18         25      138       NA       52101.     66.3
##  5     6    56        19         25      139       60      190474.     75.8
##  6     7    26        18         25      138       NA      107916.     77.9
##  7     8    25        18         25      138       NA      152393.     58.8
##  8     9    37        18         25      138       NA      130140.     99.1
##  9    10    29        18         25      139       34       88046.     67.3
## 10    11    27        18         25      138       NA      154664.     78.8
## # … with 35,602 more rows, and 20 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>
mortgage %>% ggplot(aes(x = balance_time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mortgage %>% ggplot(aes(x = status_time, y = balance_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = LTV_time)) + geom_boxplot()
## Warning: Removed 12 rows containing non-finite values (stat_boxplot).

mortgage %>% ggplot(aes(x = status_time, y = interest_rate_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = Interest_Rate_orig_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = gdp_time)) + geom_boxplot()

mortgage %>% ggplot(aes(x = status_time, y = uer_time)) + geom_boxplot()

Our Idea is to make a logistics regression to see any variable strongly influence people’s choice between payoff and default, so the first step for us to clean the data is to only select those who already reach the deadline (eliminate those have not reached the maturity) while 2 represents payoff, 1 represents default, what we do is to select only the status_time = 1 or 2

No big difference between payoff and default groups of status in outstanding balance, loan-to-value and interest rate at observation time. For interest rate at original time, the upper quartile and mean between payoff and default groups have no large difference, but the lower quartile of payoff group is apparently lower than the default group. The GDP at observation time for payoff group is apparently higher than default group. For the unemployment rate at observation time, both group have similar lower quartile, but default group has apparently higher mean and upper quartile than the payoff group.

Since this is a large dataset, we can start by a subset.We can use a random number generator because I do not want my subset to be biased and stratified.

mortgage %>% mutate(count = n())%>% mutate(random = runif(count,1.0,10000.0)) %>% filter(random < 1000)
## # A tibble: 3,562 x 30
##       id  time orig_time first_time mat_time res_time balance_time LTV_time
##    <dbl> <dbl>     <dbl>      <dbl>    <dbl>    <dbl>        <dbl>    <dbl>
##  1    23    25        18         25      139       NA       79980.     52.8
##  2    24    32        19         25      139       38       35166.     64.6
##  3    38    27        19         25      139       NA      218782.     67.7
##  4    50    27        19         25      139       NA      121370.     80.4
##  5    59    30       -15         25      105       NA       57423.     22.0
##  6    60    28       -13         25      107       NA      113538.     21.6
##  7    64    33        25         33      140       NA           0       0  
##  8    81    26        21         25      141       NA      234598.     73.8
##  9    95    31        22         25      142       43      141288.     84.8
## 10    98    37        25         33      141       41      119337.    117. 
## # … with 3,552 more rows, and 22 more variables: interest_rate_time <dbl>,
## #   rate_time <dbl>, hpi_time <dbl>, gdp_time <dbl>, uer_time <dbl>,
## #   REtype_CO_orig_time <dbl>, REtype_PU_orig_time <dbl>,
## #   REtype_SF_orig_time <dbl>, investor_orig_time <dbl>,
## #   balance_orig_time <dbl>, FICO_orig_time <dbl>, LTV_orig_time <dbl>,
## #   Interest_Rate_orig_time <dbl>, state_orig_time <chr>, hpi_orig_time <dbl>,
## #   default_time <dbl>, payoff_time <dbl>, status_time <chr>, lgd_time <dbl>,
## #   recovery_res <dbl>, count <int>, random <dbl>