Based on Prosper website: Prosper is America’s first marketplace lending platform, with over $7 billion in funded loans. Prosper allows people to invest in each other in a way that is financially and socially rewarding. On Prosper, borrowers list loan requests between $2,000 and $35,000 and individual investors invest as little as $25 in each loan listing they select. Prosper handles the servicing of the loan on behalf of the matched borrowers and investors.
Data Set contains information about Prosper loans:
loan %>% dim
## [1] 113937 81
loan %>% str
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
loan $LoanOriginalAmount %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
## .
## Q1 2006 Q1 2007 Q1 2008 Q1 2010 Q1 2011 Q1 2012 Q1 2013 Q1 2014 Q2 2006
## 315 3079 3074 1243 1744 4435 3616 12172 1254
## Q2 2007 Q2 2008 Q2 2009 Q2 2010 Q2 2011 Q2 2012 Q2 2013 Q3 2006 Q3 2007
## 3118 4344 13 1539 2478 5061 7099 1934 2671
## Q3 2008 Q3 2009 Q3 2010 Q3 2011 Q3 2012 Q3 2013 Q4 2005 Q4 2006 Q4 2007
## 3602 585 1270 3093 5632 9180 22 2403 2592
## Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q4 2013
## 532 1449 1600 3913 4425 14450
The length of the loan expressed in months.
## .
## 12 36 60
## 1614 87778 24545
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
## .
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
loan[,c("LenderYield", "EstimatedEffectiveYield", "BorrowerAPR","BorrowerRate")] %>% summary
## LenderYield EstimatedEffectiveYield BorrowerAPR
## Min. :-0.0100 Min. :-0.183 Min. :0.00653
## 1st Qu.: 0.1242 1st Qu.: 0.116 1st Qu.:0.15629
## Median : 0.1730 Median : 0.162 Median :0.20976
## Mean : 0.1827 Mean : 0.169 Mean :0.21883
## 3rd Qu.: 0.2400 3rd Qu.: 0.224 3rd Qu.:0.28381
## Max. : 0.4925 Max. : 0.320 Max. :0.51229
## NA's :29084 NA's :25
## BorrowerRate
## Min. :0.0000
## 1st Qu.:0.1340
## Median :0.1840
## Mean :0.1928
## 3rd Qu.:0.2500
## Max. :0.4975
##
The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
## .
## A AA B C D E HR NC
## 84984 3315 3509 4389 5649 5153 3289 3508 141
Since Credit Grade Rating were used before 2009, Credit Grade are not applied to 74.6% of current loan. However, it is still useful to tell us the picture of the past.
To generate a more meaningful chart, I have to reorganize the levels of CreditGrade in terms of credit risk rating.
The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
loan$ProsperRating..Alpha. %>% table
## .
## A AA B C D E HR
## 29084 14551 5372 15581 18345 14274 9795 6935
The dataset contains 113937 observations. Each observation is described by 82 variables.
I choose with 17 features out of 81, and classify them into 4 groups:
My main interest in this dataset is LenderYield which is the lender yield on the loan because I want to analyze this dataset at a glance of an investor. So there are some questions that I try to answer:
Interest Rate: We cannot find the specific interest rate in the dataset. However, I believe Lender Yield, Estimated Effective Yield, Borrower APR and Borrower Rate can represent as the interest rate. Even though they are used for different purpose, their distribution are nigh identical (Based on Figure5 ) => My question is: Are they correlated among each other?
Loan Assessment/ Loan Rating:
I did not create any new variable. However, I have to reorder levels of CreditGrade and ProsperRating..Alpha. in term of risk rating.
One of interesting observation is the multi-modal distribtuion of LoanOriginal Amount. It seems that Loans are made in the multiple of $5000.
Since I have used alot of bar plots in univariate plot session, I have created a function as called as draw_bar_plot. In this function, it has already clean the data such as removing NA and tidy them by generating factors for categorical variables.
loan[,c("LenderYield", "EstimatedEffectiveYield", "BorrowerAPR","BorrowerRate")] %>%
subset(!is.na(LenderYield) & !is.na(EstimatedEffectiveYield) &
!is.na(BorrowerAPR) & !is.na(BorrowerRate)) %>% cor
## LenderYield EstimatedEffectiveYield BorrowerAPR
## LenderYield 1.0000000 0.8953425 0.9933345
## EstimatedEffectiveYield 0.8953425 1.0000000 0.8956348
## BorrowerAPR 0.9933345 0.8956348 1.0000000
## BorrowerRate 0.9999958 0.8952825 0.9933333
## BorrowerRate
## LenderYield 0.9999958
## EstimatedEffectiveYield 0.8952825
## BorrowerAPR 0.9933333
## BorrowerRate 1.0000000
Loan Status: I focus strongly on LoanStatus variable, especially “Chargedoff”, “Defaulted” and “Completed”. It indicates the health status of all the past loans.
I thought Homeowner is one of the factor contributing to the difference in loan status. However, it shows that the distribution of loan status between homeowner and non-homeowner are almost similar. It means the proportion of charged loans are in-different whether the borrowers own a house or not. Figure13
The strongest relationship is between LenderYield and BorrowerRate. The correlation coefficient is nearly 1.
Based on the bivariate analysis, I want to check the linear relationship between ProsperRatingNumeric and predictors such as CreditScoreRangeLower, CreditScoreRangeUpper, TradesNeverDelinquent..percentage., InquiriesLast6Months. Also I add 2 more predictors: PublicRecordsLast12Months, DebtToIncomeRatio to this model.
ProsperRating.lm %>% summary
##
## Call:
## lm(formula = ProsperRating..numeric. ~ CreditScoreRangeLower +
## CreditScoreRangeUpper + TradesNeverDelinquent..percentage. +
## InquiriesLast6Months + PublicRecordsLast12Months + DebtToIncomeRatio -
## 1, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3206 -0.8063 0.1158 0.8861 9.3861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## CreditScoreRangeLower 0.499747 0.003779 132.239 < 2e-16
## CreditScoreRangeUpper -0.481626 0.003688 -130.602 < 2e-16
## TradesNeverDelinquent..percentage. 1.202335 0.040808 29.463 < 2e-16
## InquiriesLast6Months -0.278500 0.003306 -84.235 < 2e-16
## PublicRecordsLast12Months -0.154405 0.044775 -3.448 0.000564
## DebtToIncomeRatio -0.766687 0.014584 -52.572 < 2e-16
##
## CreditScoreRangeLower ***
## CreditScoreRangeUpper ***
## TradesNeverDelinquent..percentage. ***
## InquiriesLast6Months ***
## PublicRecordsLast12Months ***
## DebtToIncomeRatio ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.283 on 77551 degrees of freedom
## (36380 observations deleted due to missingness)
## Multiple R-squared: 0.9169, Adjusted R-squared: 0.9169
## F-statistic: 1.427e+05 on 6 and 77551 DF, p-value: < 2.2e-16
LenderYield.lm <-
loan %>%
select(LenderYield, LoanStatus, ProsperRating..numeric., Term) %>%
filter(LoanStatus %in% c("Defaulted", "Completed", "Chargedoff")) %>%
lm(LenderYield ~ factor(ProsperRating..numeric.) + factor(Term) + factor(LoanStatus), data = .)
LenderYield.lm %>% summary
##
## Call:
## lm(formula = LenderYield ~ factor(ProsperRating..numeric.) +
## factor(Term) + factor(LoanStatus), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.156886 -0.010819 0.000239 0.012114 0.189556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2714997 0.0007577 358.316 <2e-16
## factor(ProsperRating..numeric.)2 -0.0126750 0.0005205 -24.352 <2e-16
## factor(ProsperRating..numeric.)3 -0.0643999 0.0004772 -134.955 <2e-16
## factor(ProsperRating..numeric.)4 -0.1175471 0.0005344 -219.966 <2e-16
## factor(ProsperRating..numeric.)5 -0.1617421 0.0005506 -293.737 <2e-16
## factor(ProsperRating..numeric.)6 -0.2097962 0.0005376 -390.248 <2e-16
## factor(ProsperRating..numeric.)7 -0.2356068 0.0006597 -357.161 <2e-16
## factor(Term)36 0.0406862 0.0006017 67.615 <2e-16
## factor(Term)60 0.0635214 0.0006921 91.775 <2e-16
## factor(LoanStatus)Completed -0.0047244 0.0003593 -13.149 <2e-16
## factor(LoanStatus)Defaulted 0.0008353 0.0007717 1.082 0.279
##
## (Intercept) ***
## factor(ProsperRating..numeric.)2 ***
## factor(ProsperRating..numeric.)3 ***
## factor(ProsperRating..numeric.)4 ***
## factor(ProsperRating..numeric.)5 ***
## factor(ProsperRating..numeric.)6 ***
## factor(ProsperRating..numeric.)7 ***
## factor(Term)36 ***
## factor(Term)60 ***
## factor(LoanStatus)Completed ***
## factor(LoanStatus)Defaulted
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02244 on 25994 degrees of freedom
## (29079 observations deleted due to missingness)
## Multiple R-squared: 0.929, Adjusted R-squared: 0.9289
## F-statistic: 3.4e+04 on 10 and 25994 DF, p-value: < 2.2e-16
LenderYield.lm %>% anova
## Analysis of Variance Table
##
## Response: LenderYield
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(ProsperRating..numeric.) 6 166.464 27.7440 55111.56 < 2.2e-16
## factor(Term) 2 4.578 2.2889 4546.72 < 2.2e-16
## factor(LoanStatus) 2 0.105 0.0525 104.32 < 2.2e-16
## Residuals 25994 13.086 0.0005
##
## factor(ProsperRating..numeric.) ***
## factor(Term) ***
## factor(LoanStatus) ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since we have the relationship betwen Prosper Rating and other predictors, we can save time upon analyzing loans by looking at Prosper Rating. Prosper Rating is a good measure of risk.
The above plot shows that higher risk loans yield more. As an investor, I think the yield for Prosper Rating 7 (AA-lowest risk) is too low. If we invest in these loan, the average yielding is about 6% to 7% annually. So we should focus on loans with rating 4, 5(which is C and B) since the risk of being charged off is less than 25% but the yielding is about 15% to 20%. It means instead of investing in a large amount of money into 1 loan, we limit our risk by divesify into different small loans with rating B and C, the expected return rate could be about 16% to 18%.
## .
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 23 6213 11557 11263 2206 5530 11442 19556 35413 10734
I have created ListingYear variable to see whether the risk of bad loans are gradually exposure when the lending platform is more mature. And this is proven by the above plot. We can see that in 2009 and 2010, the number of blue dots are still noticable, but the risk of blue dots are dissolved by time when the number of loans increases.
## .
## Q1 2006 Q1 2007 Q1 2008 Q1 2010 Q1 2011 Q1 2012 Q1 2013 Q1 2014 Q2 2006
## 315 3079 3074 1243 1744 4435 3616 12172 1254
## Q2 2007 Q2 2008 Q2 2009 Q2 2010 Q2 2011 Q2 2012 Q2 2013 Q3 2006 Q3 2007
## 3118 4344 13 1539 2478 5061 7099 1934 2671
## Q3 2008 Q3 2009 Q3 2010 Q3 2011 Q3 2012 Q3 2013 Q4 2005 Q4 2006 Q4 2007
## 3602 585 1270 3093 5632 9180 22 2403 2592
## Q4 2008 Q4 2009 Q4 2010 Q4 2011 Q4 2012 Q4 2013
## 532 1449 1600 3913 4425 14450
There are some risks that cannot be accounted for in this dataset are the health of our economy and business management of Prosper:
Therefore the plot above can be used as an indicator to estimate how Prosper operates. If the proportion of Prosper loans of new quarter drops compared to the same quarter of last year, we can think of stopping our investment into new loans.
The dataset of Prosper loan had nearly 114000 loans from Nov 2005 to March 2014. Over the year the number of loans has been increasing. Based on my analysis, I can conclude that Prosper Rating is reliable. It is good indicator of risks and bad loans. Also, we find out that higher risk loans tend to yield more. By March 2014, the lending platform has matured enough to exposure the risk of bad loans. In other words, if we divesify our investments, we can expect higher return with higher risk loans on average. However, there are some unaccountable risks such as economy recession or Prosper’s pooly management. Morever, it is hard to find short term loans as 12-months with higher yeilding rate. Therefore we tend to invest to 36-months or 60-months loans.