Based on Prosper website: Prosper is America’s first marketplace lending platform, with over $7 billion in funded loans. Prosper allows people to invest in each other in a way that is financially and socially rewarding. On Prosper, borrowers list loan requests between $2,000 and $35,000 and individual investors invest as little as $25 in each loan listing they select. Prosper handles the servicing of the loan on behalf of the matched borrowers and investors.
Data Set contains information about Prosper loans:
> loan %>% dim
[1] 113937 82
> loan %>% str
'data.frame': 113937 obs. of 82 variables:
$ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
$ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
$ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
$ CreditGrade : Ord.factor w/ 9 levels ""<"NC"<"HR"<"E"<..: 6 1 3 1 1 1 1 1 1 1 ...
$ Term : int 36 36 36 36 36 60 36 36 36 36 ...
$ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
$ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
$ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
$ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
$ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
$ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
$ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
$ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
$ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
$ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
$ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
$ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
$ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
$ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
$ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
$ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
$ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
$ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
$ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
$ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
$ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
$ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
$ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
$ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
$ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
$ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
$ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
$ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
$ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
$ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
$ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
$ AmountDelinquent : num 472 0 NA 10056 0 ...
$ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
$ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
$ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
$ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
$ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
$ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
$ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
$ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
$ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
$ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
$ IncomeRange : Ord.factor w/ 8 levels "Not displayed"<..: 5 6 1 5 8 8 5 5 5 5 ...
$ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
$ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
$ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
$ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
$ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
$ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
$ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
$ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
$ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
$ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
$ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
$ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
$ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
$ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
$ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
$ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
$ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
$ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
$ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
$ MonthlyLoanPayment : num 330 319 123 321 564 ...
$ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
$ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
$ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
$ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
$ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
$ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
$ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
$ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
$ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
$ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
$ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
$ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
$ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
$ ListingCategory.factor : Ord.factor w/ 21 levels "Not Available"<..: 1 3 1 17 3 2 2 3 8 8 ...
[Figure 1]
> loan $LoanOriginalAmount %>% summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 4000 6500 8337 12000 35000
[Figure 2]
The length of the loan expressed in months.
[Figure 3]
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
[Figure 4]
[Figure 5]
> loan[,c("LenderYield", "EstimatedEffectiveYield", "BorrowerAPR","BorrowerRate")] %>% summary
LenderYield EstimatedEffectiveYield BorrowerAPR BorrowerRate
Min. :-0.0100 Min. :-0.183 Min. :0.00653 Min. :0.0000
1st Qu.: 0.1242 1st Qu.: 0.116 1st Qu.:0.15629 1st Qu.:0.1340
Median : 0.1730 Median : 0.162 Median :0.20976 Median :0.1840
Mean : 0.1827 Mean : 0.169 Mean :0.21883 Mean :0.1928
3rd Qu.: 0.2400 3rd Qu.: 0.224 3rd Qu.:0.28381 3rd Qu.:0.2500
Max. : 0.4925 Max. : 0.320 Max. :0.51229 Max. :0.4975
NA's :29084 NA's :25
The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
[Figure 6]
To generate a more meaningful chart, I have to reorganize the levels of CreditGrade in terms of credit risk rating.
[Figure 7]
The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
> loan$ProsperRating..Alpha. %>% table
.
A AA B C D E HR
29084 14551 5372 15581 18345 14274 9795 6935
[Figure 8]
[Figure 9]
[Figure 10]
[Figure 11]
[Figure 12]
[Figure 13]
Notice that:
The dataset contains 113937 observations. Each observation is described by 82 variables.
I choose with 17 features out of 81, and classify them into 4 groups:
My main interest in this dataset is LenderYield which is the lender yield on the loan because I want to analyze this dataset at a glance of an investor. So there are some questions that I try to answer:
Original Loan:
Chargedoff,
Defaultedand
Completedbecause it tells you how past loans performed. [Figure 4]
Interest Rate: We cannot find the specific interest rate in the dataset. However, I believe Lender Yield, Estimated Effective Yield, Borrower APR and Borrower Rate can represent as the interest rate. Even though they are used for different purpose, their distribution are nigh identical (Based on [Figure 5]) => My question is: Are they correlated among each other?
Loan Assessment/ Loan Rating:
a) By Credit & Rating:
b) By Income & Asset:
I did not create any new variable. However, I have to reorder levels of CreditGrade and ProsperRating..Alpha. in term of risk rating.
One of interesting observation is the multi-modal distribtuion of LoanOriginal Amount. It seems that Loans are made in the multiple of $5000.
Since I have used alot of bar plots in univariate plot session, I have created a function as called as drawbarplot. In this function, it has already clean the data such as removing NA and tidy them by generating factors for categorical variables.
[Figure 14]
[Figure 15]
[Figure 16]
> loan[,c("LenderYield", "EstimatedEffectiveYield", "BorrowerAPR","BorrowerRate")] %>%
subset(!is.na(LenderYield) & !is.na(EstimatedEffectiveYield) &
!is.na(BorrowerAPR) & !is.na(BorrowerRate)) %>% cor
LenderYield EstimatedEffectiveYield BorrowerAPR BorrowerRate
LenderYield 1.0000000 0.8953425 0.9933345 0.9999958
EstimatedEffectiveYield 0.8953425 1.0000000 0.8956348 0.8952825
BorrowerAPR 0.9933345 0.8956348 1.0000000 0.9933333
BorrowerRate 0.9999958 0.8952825 0.9933333 1.0000000
[Figure 17]
[Figure 18]
Loan Status: I focus strongly on LoanStatus variable, especially Chargedoff
, Defaulted
and Completed
. It indicates the health status of all the past loans.
ChargedOffloan is higher when Prosper Rating is lower whereas The proportion of
Completedloan is lower when Prosper Rating is lower => It means the risk of being charged off is higher with lower rating loan. [Figure 14] However, borrowers who have higher risk take larger loans. [Figure 15]
Chargedoffloan is higher when the income is lower. However the difference in proportion is not large. [Figure 14]
Chargedoffloan is higher when the term of loan is higher. [Figure 14] However, bigger loans are made under longer loan terms. [Figure 15]
Interest Rate:
Prosper Rating: There is a linear trend between Prosper Rating and
I thought Homeowner is one of the factor contributing to the difference in loan status. However, it shows that the distribution of loan status between homeowner and non-homeowner are almost similar. It means the proportion of charged loans are in-different whether the borrowers own a house or not. [Figure 14]
The strongest relationship is between LenderYield and BorrowerRate. The correlation coefficient is nearly 1.
Based on the bivariate analysis, I want to check the linear relationship between ProsperRatingNumeric and predictors such as CreditScoreRangeLower, CreditScoreRangeUpper, TradesNeverDelinquent..percentage., InquiriesLast6Months. Also I add 2 more predictors: PublicRecordsLast12Months, DebtToIncomeRatio to this model.
Call:
lm(formula = ProsperRating..numeric. ~ CreditScoreRangeLower +
CreditScoreRangeUpper + TradesNeverDelinquent..percentage. +
InquiriesLast6Months + PublicRecordsLast12Months + DebtToIncomeRatio -
1, data = .)
Residuals:
Min 1Q Median 3Q Max
-5.3206 -0.8063 0.1158 0.8861 9.3861
Coefficients:
Estimate Std. Error t value Pr(>|t|)
CreditScoreRangeLower 0.499747 0.003779 132.239 < 2e-16 ***
CreditScoreRangeUpper -0.481626 0.003688 -130.602 < 2e-16 ***
TradesNeverDelinquent..percentage. 1.202335 0.040808 29.463 < 2e-16 ***
InquiriesLast6Months -0.278500 0.003306 -84.235 < 2e-16 ***
PublicRecordsLast12Months -0.154405 0.044775 -3.448 0.000564 ***
DebtToIncomeRatio -0.766687 0.014584 -52.572 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.283 on 77551 degrees of freedom
(36380 observations deleted due to missingness)
Multiple R-squared: 0.9169, Adjusted R-squared: 0.9169
F-statistic: 1.427e+05 on 6 and 77551 DF, p-value: < 2.2e-16
[Figure 19]
[Figure 20]
Call:
lm(formula = LenderYield ~ factor(ProsperRating..numeric.) +
factor(Term) + factor(LoanStatus), data = .)
Residuals:
Min 1Q Median 3Q Max
-0.156886 -0.010819 0.000239 0.012114 0.189556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2714997 0.0007577 358.316 <2e-16 ***
factor(ProsperRating..numeric.)2 -0.0126750 0.0005205 -24.352 <2e-16 ***
factor(ProsperRating..numeric.)3 -0.0643999 0.0004772 -134.955 <2e-16 ***
factor(ProsperRating..numeric.)4 -0.1175471 0.0005344 -219.966 <2e-16 ***
factor(ProsperRating..numeric.)5 -0.1617421 0.0005506 -293.737 <2e-16 ***
factor(ProsperRating..numeric.)6 -0.2097962 0.0005376 -390.248 <2e-16 ***
factor(ProsperRating..numeric.)7 -0.2356068 0.0006597 -357.161 <2e-16 ***
factor(Term)36 0.0406862 0.0006017 67.615 <2e-16 ***
factor(Term)60 0.0635214 0.0006921 91.775 <2e-16 ***
factor(LoanStatus)Completed -0.0047244 0.0003593 -13.149 <2e-16 ***
factor(LoanStatus)Defaulted 0.0008353 0.0007717 1.082 0.279
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02244 on 25994 degrees of freedom
(29079 observations deleted due to missingness)
Multiple R-squared: 0.929, Adjusted R-squared: 0.9289
F-statistic: 3.4e+04 on 10 and 25994 DF, p-value: < 2.2e-16
###############################################################################
Analysis of Variance Table
Response: LenderYield
Df Sum Sq Mean Sq F value Pr(>F)
factor(ProsperRating..numeric.) 6 166.464 27.7440 55111.56 < 2.2e-16 ***
factor(Term) 2 4.578 2.2889 4546.72 < 2.2e-16 ***
factor(LoanStatus) 2 0.105 0.0525 104.32 < 2.2e-16 ***
Residuals 25994 13.086 0.0005
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Based on the coefficient table and QQ-Plot [Figure 19], it shows that there is a strong linear relationship and goodness of fit of the model between Prosper Rating and following predictors:
For each loan term, there is linear relationship between Prosper Rating and LoanStatus.
Since we have the relationship betwen Prosper Rating and other predictors, we can save time upon analyzing loans by looking at Prosper Rating. Prosper Rating is a good measure of risk.
At the begining, we have too many variables. Now, we can narrow down to the most three important variables: LoanStatus, ProsperRating(Numeric) and LenderYield.
[Figure 21]
The above plot shows that higher risk loans yield more. As an investor, I think the yield for Prosper Rating 7 (AA-lowest risk) is too low. If we invest in these loan, the average yielding is about 6% to 7% annually. So we should focus on loans with rating 4, 5(which is C and B) since the risk of being charged off is less than 25% but the yielding is about 15% to 20%. It means instead of investing in a large amount of money into 1 loan, we limit our risk by divesify into different small loans with rating B and C, the expected return rate could be about 16% to 18%.
[Figure 22]
I have created ListingYear variable to see whether the risk of bad loans are gradually exposure when the lending platform is more mature. And this is proven by the above plot. We can see that in 2009 and 2010, the number of blue dots are still noticable, but the risk of blue dots are dissolved by time when the number of loans increases.
[Figure 23]
There are some risks that cannot be accounted for in this dataset are the health of our economy and business management of Prosper:
Therefore the plot above can be used as an indicator to estimate how Prosper operates. If the proportion of Prosper loans of new quarter drops compared to the same quarter of last year, we can think of stopping our investment into new loans.
The dataset of Prosper loan had nearly 114000 loans from Nov 2005 to March 2014. Over the year the number of loans has been increasing. Based on my analysis, I can conclude that Prosper Rating is reliable. It is good indicator of risks and bad loans. Also, we find out that higher risk loans tend to yield more. By March 2014, the lending platform has matured enough to exposure the risk of bad loans. In other words, if we divesify our investments, we can expect higher return with higher risk loans on average. However, there are some unaccountable risks such as economy recession or Prosper’s pooly management. Morever, it is hard to find short term loans as 12-months with higher yeilding rate. Therefore we tend to invest to 36-months or 60-months loans.