Exploring Loan Data from Prosper

By Kevin Vo

I. Dataset:

Based on Prosper website: Prosper is America’s first marketplace lending platform, with over $7 billion in funded loans. Prosper allows people to invest in each other in a way that is financially and socially rewarding. On Prosper, borrowers list loan requests between $2,000 and $35,000 and individual investors invest as little as $25 in each loan listing they select. Prosper handles the servicing of the loan on behalf of the matched borrowers and investors.

Data Set contains information about Prosper loans:

> loan %>% dim
[1] 113937     82
> loan %>% str
'data.frame':   113937 obs. of  82 variables:
 $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
 $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
 $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
 $ CreditGrade                        : Ord.factor w/ 9 levels ""<"NC"<"HR"<"E"<..: 6 1 3 1 1 1 1 1 1 1 ...
 $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
 $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
 $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
 $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
 $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
 $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
 $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
 $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
 $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
 $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
 $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
 $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
 $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
 $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
 $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
 $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
 $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
 $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
 $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
 $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
 $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
 $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
 $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
 $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
 $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
 $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
 $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
 $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
 $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
 $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
 $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
 $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
 $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
 $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
 $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
 $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
 $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
 $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
 $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
 $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
 $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
 $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
 $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
 $ IncomeRange                        : Ord.factor w/ 8 levels "Not displayed"<..: 5 6 1 5 8 8 5 5 5 5 ...
 $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
 $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
 $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
 $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
 $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
 $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
 $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
 $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
 $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
 $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
 $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
 $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
 $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
 $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
 $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
 $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
 $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
 $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
 $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
 $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
 $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
 $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...
 $ ListingCategory.factor             : Ord.factor w/ 21 levels "Not Available"<..: 1 3 1 17 3 2 2 3 8 8 ...

II-A. Univariate Plot Session:

1. Original Loan:

image [Figure 1]

> loan $LoanOriginalAmount %>% summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1000    4000    6500    8337   12000   35000

image [Figure 2]

The length of the loan expressed in months.

image [Figure 3]

The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

image [Figure 4]

2. Interest Rate:

image [Figure 5]

> loan[,c("LenderYield", "EstimatedEffectiveYield", "BorrowerAPR","BorrowerRate")] %>% summary

  LenderYield      EstimatedEffectiveYield  BorrowerAPR       BorrowerRate
 Min.   :-0.0100   Min.   :-0.183          Min.   :0.00653   Min.   :0.0000
 1st Qu.: 0.1242   1st Qu.: 0.116          1st Qu.:0.15629   1st Qu.:0.1340
 Median : 0.1730   Median : 0.162          Median :0.20976   Median :0.1840
 Mean   : 0.1827   Mean   : 0.169          Mean   :0.21883   Mean   :0.1928
 3rd Qu.: 0.2400   3rd Qu.: 0.224          3rd Qu.:0.28381   3rd Qu.:0.2500
 Max.   : 0.4925   Max.   : 0.320          Max.   :0.51229   Max.   :0.4975
                   NA's   :29084           NA's   :25

3. Loan Assessment By Credit & Rating:

The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.

image [Figure 6]

To generate a more meaningful chart, I have to reorganize the levels of CreditGrade in terms of credit risk rating.

image [Figure 7]

The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.

> loan$ProsperRating..Alpha. %>% table
.
          A    AA     B     C     D     E    HR
29084 14551  5372 15581 18345 14274  9795  6935

image [Figure 8]

4. Loan Assessment By Income & Asset:

image [Figure 9]

image [Figure 10]

5. Other Information related to Loan:

image [Figure 11]

image [Figure 12]

image [Figure 13]

Notice that:

II-B. Univariate Analysis

What is the structure of your dataset?

The dataset contains 113937 observations. Each observation is described by 82 variables.

I choose with 17 features out of 81, and classify them into 4 groups:

What is/are the main feature(s) of interest in your dataset?

My main interest in this dataset is LenderYield which is the lender yield on the loan because I want to analyze this dataset at a glance of an investor. So there are some questions that I try to answer:

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

I did not create any new variable. However, I have to reorder levels of CreditGrade and ProsperRating..Alpha. in term of risk rating.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

III-A. Bivariate Plot Session:

1. Loan Status:

image [Figure 14]

image [Figure 15]

2. Interest Rate

image [Figure 16]

> loan[,c("LenderYield", "EstimatedEffectiveYield", "BorrowerAPR","BorrowerRate")] %>%
subset(!is.na(LenderYield) & !is.na(EstimatedEffectiveYield) &
       !is.na(BorrowerAPR) & !is.na(BorrowerRate)) %>% cor
       
                        LenderYield EstimatedEffectiveYield BorrowerAPR BorrowerRate
LenderYield               1.0000000               0.8953425   0.9933345    0.9999958
EstimatedEffectiveYield   0.8953425               1.0000000   0.8956348    0.8952825
BorrowerAPR               0.9933345               0.8956348   1.0000000    0.9933333
BorrowerRate              0.9999958               0.8952825   0.9933333    1.0000000

image [Figure 17]

3. Prosper Rating:

image [Figure 18]

III-B. Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I thought Homeowner is one of the factor contributing to the difference in loan status. However, it shows that the distribution of loan status between homeowner and non-homeowner are almost similar. It means the proportion of charged loans are in-different whether the borrowers own a house or not. [Figure 14]

What was the strongest relationship you found?

The strongest relationship is between LenderYield and BorrowerRate. The correlation coefficient is nearly 1.

IV-A. Multivariate Plots Section

Based on the bivariate analysis, I want to check the linear relationship between ProsperRatingNumeric and predictors such as CreditScoreRangeLower, CreditScoreRangeUpper, TradesNeverDelinquent..percentage., InquiriesLast6Months. Also I add 2 more predictors: PublicRecordsLast12Months, DebtToIncomeRatio to this model.

Call:
lm(formula = ProsperRating..numeric. ~ CreditScoreRangeLower +
    CreditScoreRangeUpper + TradesNeverDelinquent..percentage. +
    InquiriesLast6Months + PublicRecordsLast12Months + DebtToIncomeRatio -
    1, data = .)

Residuals:
    Min      1Q  Median      3Q     Max
-5.3206 -0.8063  0.1158  0.8861  9.3861

Coefficients:
                                    Estimate Std. Error  t value Pr(>|t|)
CreditScoreRangeLower               0.499747   0.003779  132.239  < 2e-16 ***
CreditScoreRangeUpper              -0.481626   0.003688 -130.602  < 2e-16 ***
TradesNeverDelinquent..percentage.  1.202335   0.040808   29.463  < 2e-16 ***
InquiriesLast6Months               -0.278500   0.003306  -84.235  < 2e-16 ***
PublicRecordsLast12Months          -0.154405   0.044775   -3.448 0.000564 ***
DebtToIncomeRatio                  -0.766687   0.014584  -52.572  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.283 on 77551 degrees of freedom
  (36380 observations deleted due to missingness)
Multiple R-squared:  0.9169,    Adjusted R-squared:  0.9169
F-statistic: 1.427e+05 on 6 and 77551 DF,  p-value: < 2.2e-16

image [Figure 19]

image [Figure 20]

Call:
lm(formula = LenderYield ~ factor(ProsperRating..numeric.) +
    factor(Term) + factor(LoanStatus), data = .)

Residuals:
      Min        1Q    Median        3Q       Max
-0.156886 -0.010819  0.000239  0.012114  0.189556

Coefficients:
                                   Estimate Std. Error  t value Pr(>|t|)
(Intercept)                       0.2714997  0.0007577  358.316   <2e-16 ***
factor(ProsperRating..numeric.)2 -0.0126750  0.0005205  -24.352   <2e-16 ***
factor(ProsperRating..numeric.)3 -0.0643999  0.0004772 -134.955   <2e-16 ***
factor(ProsperRating..numeric.)4 -0.1175471  0.0005344 -219.966   <2e-16 ***
factor(ProsperRating..numeric.)5 -0.1617421  0.0005506 -293.737   <2e-16 ***
factor(ProsperRating..numeric.)6 -0.2097962  0.0005376 -390.248   <2e-16 ***
factor(ProsperRating..numeric.)7 -0.2356068  0.0006597 -357.161   <2e-16 ***
factor(Term)36                    0.0406862  0.0006017   67.615   <2e-16 ***
factor(Term)60                    0.0635214  0.0006921   91.775   <2e-16 ***
factor(LoanStatus)Completed      -0.0047244  0.0003593  -13.149   <2e-16 ***
factor(LoanStatus)Defaulted       0.0008353  0.0007717    1.082    0.279
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02244 on 25994 degrees of freedom
  (29079 observations deleted due to missingness)
Multiple R-squared:  0.929, Adjusted R-squared:  0.9289
F-statistic: 3.4e+04 on 10 and 25994 DF,  p-value: < 2.2e-16

###############################################################################

Analysis of Variance Table

Response: LenderYield
                                   Df  Sum Sq Mean Sq  F value    Pr(>F)
factor(ProsperRating..numeric.)     6 166.464 27.7440 55111.56 < 2.2e-16 ***
factor(Term)                        2   4.578  2.2889  4546.72 < 2.2e-16 ***
factor(LoanStatus)                  2   0.105  0.0525   104.32 < 2.2e-16 ***
Residuals                       25994  13.086  0.0005
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

IV-B. Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

Since we have the relationship betwen Prosper Rating and other predictors, we can save time upon analyzing loans by looking at Prosper Rating. Prosper Rating is a good measure of risk.

V. Final Plots and Summary

Plot One

At the begining, we have too many variables. Now, we can narrow down to the most three important variables: LoanStatus, ProsperRating(Numeric) and LenderYield.

image [Figure 21]

Description One

The above plot shows that higher risk loans yield more. As an investor, I think the yield for Prosper Rating 7 (AA-lowest risk) is too low. If we invest in these loan, the average yielding is about 6% to 7% annually. So we should focus on loans with rating 4, 5(which is C and B) since the risk of being charged off is less than 25% but the yielding is about 15% to 20%. It means instead of investing in a large amount of money into 1 loan, we limit our risk by divesify into different small loans with rating B and C, the expected return rate could be about 16% to 18%.

Plot Two

image [Figure 22]

Description Two

I have created ListingYear variable to see whether the risk of bad loans are gradually exposure when the lending platform is more mature. And this is proven by the above plot. We can see that in 2009 and 2010, the number of blue dots are still noticable, but the risk of blue dots are dissolved by time when the number of loans increases.

Plot Three

image [Figure 23]

Description Three

There are some risks that cannot be accounted for in this dataset are the health of our economy and business management of Prosper:

Therefore the plot above can be used as an indicator to estimate how Prosper operates. If the proportion of Prosper loans of new quarter drops compared to the same quarter of last year, we can think of stopping our investment into new loans.


Reflection

The dataset of Prosper loan had nearly 114000 loans from Nov 2005 to March 2014. Over the year the number of loans has been increasing. Based on my analysis, I can conclude that Prosper Rating is reliable. It is good indicator of risks and bad loans. Also, we find out that higher risk loans tend to yield more. By March 2014, the lending platform has matured enough to exposure the risk of bad loans. In other words, if we divesify our investments, we can expect higher return with higher risk loans on average. However, there are some unaccountable risks such as economy recession or Prosper’s pooly management. Morever, it is hard to find short term loans as 12-months with higher yeilding rate. Therefore we tend to invest to 36-months or 60-months loans.