CRAIGLIST CAR FOR SALE DATA EXPLORATION¶

The data set contains about vehicles (cars, trucks, vans, ...) for sale on a Web site. Individuals (owners) or dealers create a post for a vehicle, and provide information about it such as the make, model and year of the vehicle, put it in a broad category for condition (e.g., excellent, good, ...), how many miles it has on the odometer, the type of vehicle (e.g. hatchback, sudan), what type of fuel it uses (e.g., gas or diesel). We also have information about the location of the poster/vehicle - the city, and often the latitude and longitude. Importantly, we have the price being sought by the seller. We also have a short free-form description of the vehicle.

I. SETUP:¶

# Load the data file called vehicles.rda from the subdirectory called data.
load("./data/vehicles.rda")

library(lattice)
library(maps)

library(ggplot2)
library(gmodels)
library(RColorBrewer)

# Create ownerDealer variable
vposts$ownerDealer = factor(vposts$byOwner, levels = c(TRUE, FALSE), labels = c("Owner", "Dealer"))

II. DATA EXPLORATION:¶

How many observations are there in the data set?¶

# We can get the number of observations by counting the number of rows
nrow(vposts)

## [1] 34677

# Alternatively, we could get the dimensions of the dataset and look at the number of rows (observations)
dim(vposts)

## [1] 34677    27

What are the names of the variables? and what is the class of each variable?¶

# Names of the variables
names(vposts)

##  [1] "id"           "title"        "body"         "lat"         
##  [5] "long"         "posted"       "updated"      "drive"       
##  [9] "odometer"     "type"         "header"       "condition"   
## [13] "cylinders"    "fuel"         "size"         "transmission"
## [17] "byOwner"      "city"         "time"         "description" 
## [21] "location"     "url"          "price"        "year"        
## [25] "maker"        "makerMethod"  "ownerDealer"

# Get the class of each variable using sapply() and then unlist() the results
unlist( sapply(X = vposts, FUN = class) )

##           id        title         body          lat         long 
##  "character"  "character"  "character"    "numeric"    "numeric" 
##      posted1      posted2     updated1     updated2        drive 
##    "POSIXct"     "POSIXt"    "POSIXct"     "POSIXt"     "factor" 
##     odometer         type       header    condition    cylinders 
##    "integer"     "factor"  "character"     "factor"    "integer" 
##         fuel         size transmission      byOwner         city 
##     "factor"     "factor"     "factor"    "logical"     "factor" 
##        time1        time2  description     location          url 
##    "POSIXct"     "POSIXt"  "character"  "character"  "character" 
##        price         year        maker  makerMethod  ownerDealer 
##    "integer"    "integer"  "character"    "numeric"     "factor"

# Alternative solution that is cleaner as suggested by Ian Heath on Piazza
print.table( sapply(X = vposts, FUN = class) )

##              id           title            body             lat 
##       character       character       character         numeric 
##            long          posted         updated           drive 
##         numeric POSIXct, POSIXt POSIXct, POSIXt          factor 
##        odometer            type          header       condition 
##         integer          factor       character          factor 
##       cylinders            fuel            size    transmission 
##         integer          factor          factor          factor 
##         byOwner            city            time     description 
##         logical          factor POSIXct, POSIXt       character 
##        location             url           price            year 
##       character       character         integer         integer 
##           maker     makerMethod     ownerDealer 
##       character         numeric          factor

What is the average price of all the vehicles? the median price? and the deciles? Displays these on a plot of the distribution of vehicle prices.¶

# The densityplot (with rug plot) indicates that there are a few extreme outliers in the right tail.
densityplot(vposts$price, main = "Price", xlab = "Price")

# Let's see the 50 largest prices
tail( sort(vposts$price), 50 )

##  [1]     95593     96590     97000     97500     97911     98000     99560
##  [8]     99999    100000    100000    100000    100000    104800    105000
## [15]    105500    107000    112000    116100    116491    120000    122950
## [22]    123981    125000    129950    129990    138500    139000    139950
## [29]    143000    143000    143950    147000    149890    149995    150000
## [36]    152900    159000    169000    177588    202455    240000    286763
## [43]    359000    400000    559500    569500   9999999  30002500 600030000
## [50] 600030000

# For sure, the prices 9999999 30002500 600030000 600030000 are highly suspicious.
# Let's look at any car more than $100,000.
idx = which( vposts$price >=  100000 & !is.na(vposts$price) ) 

# There are 42 cars > $100,000
length( idx )

## [1] 42

# Let's order the row indices by price low-to-high
idx = idx[order(vposts[ idx, "price"])]

# Let's look at some of the actual data for those cars > $100,000
vposts[ idx, c("header", "price") ]

    ##                                                       header     price
## posted12461                                 2000 Mack RD688S    100000
## posted24081                           2003 lincoln navigator    100000
## posted181511                                1965 porsche 911    100000
## posted245512                               1961 Maserati 151    100000
## posted212510                               2014 Porsche  911    104800
## posted13245                                 2010 ford fusion    105000
## posted7225                           1967 chevrolet corvette    105500
## posted20867                          2015 mercedes-benz s550    107000
## posted194311                              2005 TOYOTA AVALON    112000
## posted222912                          2015 Porsche  Panamera    116100
## posted37310                      2014 Audi RS 7 4.0T quattro    116491
## posted95215                1988 porsche 911 Carrera Targa TL    120000
## posted18546                           2013 Mercedes-Benz G63    122950
## posted106214                   2014 Land Rover Range 5.0L V8    123981
## posted212613                                     1941 willys    125000
## posted11066                        2009 Lamborghini Gallardo    129950
## posted16934                                   2013 Isuzu NRR    129990
## posted191712                             2015 Hyundai Sonata    138500
## posted112215                                1976 Porsche 930    139000
## posted18506                          2009 Mercedes-Benz SL65    139950
## posted163710                     2015 Mercedes-Benz  S-Class    143000
## posted231215                     2015 Mercedes-Benz  S-Class    143000
## posted214110                           2011 Bentley Mulsanne    143950
## posted238613                                2015 Porsche GT3    147000
## posted5038   2012 Mercedes-Benz SLS AMG 2dr Roadster SLS AMG    149890
## posted129214                2007 Lamborghini Gallardo Spyder    149995
## posted22621                                        2013 Ford    150000
## posted231114                               2015 Porsche  911    152900
## posted220311                                2011 toyota rav4    159000
## posted6747                                    2004 Lexus 470    169000
## posted698                                        2008 BMW X5    177588
## posted121313                                2016 porsche 911    202455
## posted12630                          2014 ferrari 458 italia    240000
## posted9976                               2004 Toyota Corolla    286763
## posted1460                          2010 CHEVROLET SILVERADO    359000
## posted23788                                     2006 FORD GT    400000
## posted21402                            2009 CHEVROLET IMPALA    559500
## posted21422                       2007 CHEVROLET MONTE CARLO    569500
## posted16005                                2001 Honda Accord   9999999
## posted6903                            2002 Caddy Seville sls  30002500
## posted22491                                 1969 Pontiac GTO 600030000
## posted23881                                 1969 Pontiac GTO 600030000

There are some very expensive cars in there like Mercedes-Benz G63 AMG, Bentley Mulsanne, Maserati 3500 GT, Porsche GT, etc. which are probably legitimately more than \$100,000. However, there are also cars in there such as “2015 Hyundai Sonata Call/SMS 650.445.0890 (12) - \$138500” which probably has an extra zero at the end and even a “2007 CHEVROLET MONTE CARLO LT - Easy Financing! Any Credit Auto Loans! BHPH! - \$569500” which probably has two extra zeros at the end. We will ignore them for now, but acknowledge that these are also problems and that they should be cleaned up.

# Let's look at the worst price errors
vposts[ vposts$price >= 9999999 & !is.na(vposts$price), c("header", "body")]

##                             header
## posted22491       1969 Pontiac GTO
## posted23881       1969 Pontiac GTO
## posted6903  2002 Caddy Seville sls
## posted16005      2001 Honda Accord
##                                                                                                                                                                                                                                                                                                                                                                                                                                  body
## posted22491 \n        We have 1968 & 1969 Pontiac GTO's.\nCurrently we are working on a 1968 end a 1969 Gto project is almost    complete.\nOur Intention is the custom to specification by owner.\nCost will be between $6000 & $30,000. This will be depending on the     car in the condition and the Owner financial capabilities. \nSerious inquires only inquiries only.. please call Tony at \n show contact     info\n\n    
## posted23881 \n        We have 1968 & 1969 Pontiac GTO's.\nCurrently we are working on a 1968 end a 1969 Gto project is almost    complete.\nOur Intention is the custom to specification by owner.\nCost will be between $6000 & $30,000. This will be depending on the     car in the condition and the Owner financial capabilities. \nSerious inquires only inquiries only.. please call Tony at \n show contact     info\n\n    
## posted6903                                                                                                                                                                                                                    \n        clean, fully loaded, nice shine, good    running engine and trans, willing to trade for old school or truck?????????????????? Mounted on 22 inch rims new tires no bends no     cracks\n    
## posted16005                                                                                                                                                                                                                                                                                                                        \n        Selling my car for some lunch money. $20 OBO. Comes with  complimentary Oboe.\n

The two records with price of 600030000 are 1968 and 1969 Pontiac GTO - \$600030000, which can see from reading the body of the post that these are offers to customize GTO’s for somewhere between \$6,000 and \$30,000 and these values seem to be pasted together. We can revisit this in question #8, but for now, let’s exclude these.

Best guess for the Seville SLS with price 30002500 looks like a typo by the person who posted the ad. Perhaps they were originally thinking that they would sell it for \$3,000, but then changed their mind and lowered the price to \$2,500 but did not properly clear the field for price before they editing the posting price. We can revisit this in question #8, but for now, let’s exclude this.

The 2001 Honda Accord with price 9999999 looks like a deliberate misleading price by the person who posted the ad given that they failed to fill in several other fields.

As a side note, numbers like 99, or 999 or 99999999, etc. are often default entries for missing values, much like NA, #N/A, , Null, NULL, etc. Depending on the data, even numbers like -1 could be defined as missing data. We can revisit this in question #8, but for now, let’s exclude this.

For Question #3 here, all cars with a price greater than \$240,000 seem to have data errors so we will exclude all of them here.

We also note several other problems like lots of missing prices in general. These could be corrected one-by-one perhaps.

There are also lots of cars for sale at the price of \$1. This is a common advertising tactic to post for the minimum price since most people sort prices lowest-to-highest and thus these ads get seen at the top more often. Most of these are misleading ads by dealers, some are for car parts, some are offers for car financing. The same holds for almost all ads less than \$500. There is just too much data to clean up manually here so we will exclude them. SURELY we are throwing away some good data here and biasing our results.

# How many prices are NA?
sum( is.na(vposts$price) )

## [1] 3328

# How many cars for sale at $1?
sum( vposts$price == 1 & !is.na(vposts$price) )

## [1] 612

# Let's take a look 10% randomly selected $1 cars
idx = which( vposts$price == 1 & !is.na(vposts$price) )
idx = sample(x = idx, size = 60, replace = FALSE)

# Omitted here for brevity
# vposts[ idx, "body"]

# Only include prices between $500 and $240,000
idx = (vposts$price >= 500 & vposts$price <= 240000 & !is.na(vposts$price))

densityplot(vposts$price[ idx ], xlab = "Price", main = "Price")

# 94% of the data is between about $500 and $50,000
quantile(x = vposts$price, probs = c(0.05,0.99), na.rm = TRUE)

##    5%   99% 
##   499 47000

# Final plot
avg = mean( vposts$price[ idx ] )
med = median( vposts$price[ idx ] )
dec = quantile(x = vposts$price[ idx ], probs = seq(from = 0.1, to = 0.9, by = 0.1) )

idx = (vposts$price >= 500 & vposts$price <= 50000 & !is.na(vposts$price))

plot(density(vposts$price[ idx ]), xlab = "Price", main = "Price", lwd = 2)
abline(h = 0)
points(x = c(avg, med), y = c(0,0), col = c("forestgreen", "darkorange"), pch = 18, cex = 3.0)
points(x = dec, y = rep(x = 0, times = length(dec)), col = "blue2", pch = 18, cex = 1.5)
legend("topright", legend = c("Mean", "Median", "Deciles"), col = c("forestgreen", "darkorange", "blue2"), 
       pch = 18)

What are the different categories of vehicles, i.e. the type variable/column? What is the proportion for each category ?¶

# Different types.  Note the NA's
names( table(vposts$type, useNA = "ifany") )

##  [1] "bus"         "convertible" "coupe"       "hatchback"   "mini-van"   
##  [6] "offroad"     "other"       "pickup"      "sedan"       "SUV"        
## [11] "truck"       "van"         "wagon"       NA

# Proportions table, sorted
sort( round( x = prop.table( x = table(vposts$type, useNA = "ifany") ), digits = 4) )

## 
##         bus     offroad    mini-van         van       wagon       other 
##      0.0006      0.0019      0.0131      0.0146      0.0161      0.0192 
## convertible   hatchback      pickup       truck       coupe         SUV 
##      0.0204      0.0236      0.0262      0.0347      0.0469      0.1214 
##       sedan        <NA> 
##      0.2030      0.4583

# Let's plot the proportions
# Base dotchart() will not show the label NA and Lattice dotplot() will not plot the NA at all so force it
t = prop.table( x = table(vposts$type, useNA = "ifany") )
names(t)[ is.na(names(t)) ] = "NA"

# dotchart(x = sort(t), xlim = c(0,1), main = "Proportions of Car Types")

dotplot(x = sort(t), xlim = c(-0.05, 1.05), cex = 1.5, main = "Proportions of Car Types")

# Let's look at proportions *among* the non-missing
t = prop.table( x = table(vposts$type[ !is.na(vposts$type) ], useNA = "ifany") )
dotplot(x = sort(t), xlim = c(-0.05, 1.05), cex = 1.5, main = "Proportions of Car Types Among Non-Missing")

Close to half of the data is missing the vehicle type. Among those not missing the vehicle type, about 40% are sedans.

Display the relationship between fuel type and vehicle type. Does this depend on transmission type?¶

What we can see from the mosaic plots below overall and by transmission type and ignoring vehicle type other and fuel type other, is that gas vehicles dominate across vehicle types and across transmission types, with the notable exception that trucks have higher percent diesel than other vehicle types, as do buses with automatic transmissions - this is expected.

It might be easier to see these same relationships in dotplots than mosaicplots.

# Overall relationship fuel type and vehicle type
tbl = table(vposts$fuel, vposts$type)
row.order = order( rowSums( tbl ), decreasing = TRUE )
col.order = order( colSums( tbl ), decreasing = TRUE )
tbl = tbl[ row.order, col.order ]

col.palette = colorRampPalette(brewer.pal(9,"Blues"))(length(col.order)) 

mosaicplot(tbl, las = 2, color = col.palette, 
           main =  "Overall Fuel Type by Vehicle Type", cex = 0.7
)

# Split by transmission type
fuelVehicleBytransmission = split( vposts[ , c("fuel", "type")], vposts$transmission)

invisible(
  lapply( 1:length(fuelVehicleBytransmission), 
          FUN = function(x){
            tbl = table(fuelVehicleBytransmission[[x]]$fuel, fuelVehicleBytransmission[[x]]$type)

            row.order = order( rowSums( tbl ), decreasing = TRUE )
            col.order = order( colSums( tbl ), decreasing = TRUE )
            tbl = tbl[ row.order, col.order ]

            # Get color palette and then reverse the order darkest to lightest
            col.palette = colorRampPalette(brewer.pal(9,"Blues"))(length(col.order)) 
            col.palette = col.palette[ length(col.palette):1 ]

            mosaicplot(tbl, las = 2, color = col.palette, 
                       main = paste0("Fuel and Vehicle by Transmission = ",
                                     names(fuelVehicleBytransmission)[x]), cex = 0.7
            )
          }
  )
)

dotplot( 
  prop.table( table(vposts$type , vposts$transmission,  vposts$fuel, useNA = "ifany"),
                     margin = c(1,2) ), 
  xlim = c(-0.05,1.05), auto.key = list(columns = 3), par.settings = simpleTheme(cex=1.2, pch=16),
  xlab = "Percent"
)

How many different cities are represented in the dataset?¶

# levels() gives us the cities
levels(vposts$city)

## [1] "boston"   "chicago"  "denver"   "lasvegas" "nyc"      "sac"     
## [7] "sfbay"

# length() tells us how many there are
length( levels(vposts$city) )

## [1] 7

Visually display how the number/proportion of “for sale by owner” and “for sale by dealer” varies across city?¶

It is a bit suspicious that all cities have roughly 5000 observations and that the percentages are almost perfectly 50/50 within each city. Smells like the data was filtered first…

Note that we also created a new variable called ownerDealer in the setup section.

# Each city has ~5000 observations
table(vposts$city)

## 
##   boston  chicago   denver lasvegas      nyc      sac    sfbay 
##     4958     4886     4979     4963     4983     4966     4942

# prop.table() with margin = 2 gives the column percentages
prop.table(table(vposts$ownerDealer, vposts$city), margin = 2)

##         
##             boston   chicago    denver  lasvegas       nyc       sac
##   Owner  0.4975797 0.4901760 0.4994979 0.4984888 0.4992976 0.5000000
##   Dealer 0.5024203 0.5098240 0.5005021 0.5015112 0.5007024 0.5000000
##         
##              sfbay
##   Owner  0.4991906
##   Dealer 0.5008094

# Plot the counts from table() (Note: almost identical graphs as percentages)
plot( table(vposts$city, vposts$ownerDealer, useNA = "ifany"), main = "Owner/Dealer by City",
      xlab = "City", ylab = "Owner or Dealer", col=c("darkblue","darkviolet"))

# Plot the percentages from prop.table (Note the switch of the variables)
plot( prop.table(table(vposts$city, vposts$ownerDealer), margin = 2), main = "Owner/Dealer by City",
      xlab = "City", ylab = "Owner or Dealer", col=c("darkblue","darkviolet"))

# Use a dotplot since there are only two categories
tbl = prop.table( table(vposts$ownerDealer, vposts$city, useNA = "ifany"), margin = 2)

dotplot( sort(tbl[ 1, ]), xlim = c(-0.05, 1.05), xlab = "% Owner for Sale by Owner", ylab = "City",
         main = "Percent for Sale by Owner by City", cex = 1.5)

The barplots basically capture the key information, but notice that the bar widths do not really represent anything meaningful. Since we are interested in the percentage for sale by owner within each city, a dotplot probably captures this the best since there are only two categories (byOwner = TRUE = owner, and byOwner = FALSE = dealer) and not missing data.

We can see from the tables and very clearly from the plots that the percentages of owner posting and dealer postings are almost perfectly 50/50 and this does not seem to vary across the different cities at all.

What is the largest price for a vehicle in this data set? Examine this and fix the value. Now examine the new highest value for price.¶

# Being pedantic, let's get the max price
max(vposts$price, na.rm = TRUE)

## [1] 600030000

# Let's remind ourselves of the few worst prices
idx = which( vposts$price >=  9999999 & !is.na(vposts$price) ) 

# Let's look at the actual data for those cars > $100,000
vposts[ idx, c("header", "price", "maker", "year") ]

##                             header     price    maker year
## posted22491       1969 Pontiac GTO 600030000  pontiac 1969
## posted23881       1969 Pontiac GTO 600030000  pontiac 1969
## posted6903  2002 Caddy Seville sls  30002500 cadillac 2002
## posted16005      2001 Honda Accord   9999999    honda 2001

# Let's fix the two records for the 1969 Pontiac GTO with price = 600030000
# We could remove it completely - maybe not the best approach.
# We could average the prices from the description, $6000 and $30,000 = $18,000 - not bad approach.
# We could research this car online and try to find some average price - pretty good approach.
# Let's see if we can't find a decent point estimate from within the dataset itself:
idx = ( vposts$maker == "pontiac" & vposts$year %in% c(1968, 1969) & 
          vposts$price < 9999999 & vposts$price > 1 &
          grepl(pattern = "GTO", x = vposts$header, ignore.case = TRUE) &
          !is.na(vposts$maker) & !is.na(vposts$price) & !is.na(vposts$header) )

dat = vposts[ idx, c("header", "price", "maker", "year") ]
dat[ order(dat$price), ]

##                        header price   maker year
## posted4991   1968 pontiac gto 15995 pontiac 1968
## posted5371   1968 pontiac gto 15995 pontiac 1968
## posted231214 1968 Pontiac GTO 24500 pontiac 1968
## posted16497  1969 Pontiac GTO 25000 pontiac 1969
## posted201111 1969 Pontiac GTO 25000 pontiac 1969
## posted7355   1968 pontiac gto 30000 pontiac 1968
## posted16701  1968 Pontiac gto 38500 pontiac 1968
## posted40911          1968 GTO 38500 pontiac 1968

# Let's use a rounded mean price
newPrice = round( mean(vposts$price[idx]), digits = -3)
vposts$price[vposts$price == 600030000 & !is.na(vposts$price)] = newPrice


# Now let's fix the 2002 Caddy Seville sls with price = 30002500
# Let's see if we can't find a decent point estimate from within the dataset itself:
idx = ( vposts$maker == "cadillac" & vposts$year %in% c(2002) & 
          vposts$price < 9999999 & vposts$price > 1 &
          grepl(pattern = "Seville", x = vposts$header, ignore.case = TRUE) &
          !is.na(vposts$maker) & !is.na(vposts$price) & !is.na(vposts$header) )

dat = vposts[ idx, c("header", "price", "maker", "year") ]
dat[ order(dat$price), ]

##                            header price    maker year
## posted20963 2002 cadillac seville   700 cadillac 2002
## posted20691 2002 Cadillac Seville  1500 cadillac 2002
## posted9323  2002 cadillac seville  2100 cadillac 2002
## posted16927 2002 cadillac seville  3495 cadillac 2002

# Average price estimate is lower than the posting price of $2500 or $3000, so use the lower $2500
round( mean(vposts$price[idx]), digits = -3)

## [1] 2000

vposts$price[vposts$price == 30002500 & !is.na(vposts$price)] = 2500

# The 2001 Honda Accord with price = 9999999 is a junk post 
# "Selling my car for some lunch money. $20 OBO. Comes with complimentary Oboe."
# Remove it completely.
idx = which(vposts$price == 9999999 & !is.na(vposts$price))
vposts = vposts[ -idx, ]

Again, SURELY there are more that need to be fixed, but that was enough pain for one day.

What are the three most common makes of cars in each city for “sale by owner” and for “sale by dealer”? Are they similar or quite different?¶

cities = levels(vposts$city)

# We could do the inner function in one line, but it is hard to read so break it down into steps.
#      names( head( sort( table(vposts$maker[ vposts$city == x & vposts$byOwner == y & !is.na(vposts$city) ]),
#             decreasing = TRUE), 3) )
makeByCityByOwner = lapply(X = c(TRUE, FALSE), FUN = function(y){
  sapply(X = cities, FUN = function(x){
    t = table(vposts$maker[ vposts$city == x & vposts$byOwner == y & !is.na(vposts$city) ])
    s = sort(t, decreasing = TRUE)
    h = head(s, 3)
    n = names(h)
  })
})

names(makeByCityByOwner) = c("Owner", "Dealer")

makeByCityByOwner

## $Owner
##      boston      chicago     denver      lasvegas    nyc      sac        
## [1,] "ford"      "chevrolet" "ford"      "ford"      "nissan" "toyota"   
## [2,] "honda"     "ford"      "chevrolet" "chevrolet" "toyota" "ford"     
## [3,] "chevrolet" "honda"     "toyota"    "toyota"    "honda"  "chevrolet"
##      sfbay   
## [1,] "toyota"
## [2,] "honda" 
## [3,] "ford"  
## 
## $Dealer
##      boston      chicago     denver      lasvegas    nyc      sac        
## [1,] "ford"      "chevrolet" "ford"      "ford"      "nissan" "ford"     
## [2,] "toyota"    "ford"      "chevrolet" "nissan"    "toyota" "toyota"   
## [3,] "chevrolet" "nissan"    "dodge"     "chevrolet" "honda"  "chevrolet"
##      sfbay   
## [1,] "toyota"
## [2,] "ford"  
## [3,] "bmw"

# Check if the top in each city by owner and by dealer match
makeByCityByOwner$Owner[ 1, ] == makeByCityByOwner$Dealer[ 1, ]

##   boston  chicago   denver lasvegas      nyc      sac    sfbay 
##     TRUE     TRUE     TRUE     TRUE     TRUE    FALSE     TRUE

A quick visual scan shows that 2 of the top 3 makes by city for sale by owner are among the top 3 for sale by dealer within the same city. The top in each city by owner is the same as by dealer in all cities except SacTown. They are fairly similar.

Visually compare the distribution of the age of cars for different cities and for “sale by owner” and “sale by dealer”. Provide an interpretation of the plots, i.e., what are the key conclusions and insights?¶

There are a few clear data errors with years in c(4, 1900, 2022), potentially others as well. We could fix them, or just ignore them since there are only a few data points that will not affect our analysis much.

The 2022 Honda Odyssey “has only 117102 miles” so it is probably a typo for 2002, so let’s fix it that way.

The year = 1900 ads are for wheels and buying back cars that fail the smog test so they are not cars at all. Remove them.

The Jeep with year = 4 is probably 2004 since it has a “AM/FM cassette player-muli CD player”, so let’s fix it that way.

# Omitted here for brevity
# vposts[ vposts$year %in% c(4, 1900, 2022) & !is.na(vposts$year), ]

# Clean up
vposts$year[ vposts$year == 2022 & !is.na(vposts$year) ] = 2002
vposts = vposts[ -which(vposts$year == 1900 & !is.na(vposts$year)), ]
vposts$year[ vposts$year == 4 & !is.na(vposts$year) ] = 2004

# Create an age variable
vposts$age = 2016 - vposts$year

# Overall by owner has older cars, dealers have newer cars
histogram( ~ age | byOwner, data = vposts, main = "Age by Owner/Dealer", col = "lightblue")

# Subset to zoom in a bit on the plots
idx = ( vposts$age < 25 & !is.na(vposts$age))

# Zoomed in
histogram( ~ age | byOwner, data = vposts, subset = idx, main = "Age by Owner/Dealer Zoomed",
           col = "lightblue")

# Looking by city, there is not much difference at all across cities for owner vs. dealer
histogram( ~ age | byOwner + city, data = vposts, subset = idx, main = "Age by City by Owner/Dealer Zoomed",
           col = "lightblue")

It does seem that cars for sale by owners tend to be older than cars for sale by dealers. However, this does not seem to vary much by city.

Plot the locations of the posts on a map? What do you notice?¶

Remember that the data was limited to just 7 major cities. Thus it would be a mistake to conclude something like almost all used cars on Craigslist (or from whatever website this data comes) sell in these 7 cities. We can conclude that the location of the people (and/or the cars themselves) who sell used cars in these major cities tend to be clustered fairly tightly around the major cities. Most buyers may be unlikely to travel far to look at, let alone, buy a used car.

There could be several explanations for the points far from one of the major cities. They could be travelling when they actaully posted the ad, for example. But in general, the location of the person posting the car is generally the same as the city in which they are attempting to sell the vehicle.

We also notice in the second plot of California that it seems that many people in Sac Town post their car on SF Bay Area, which we can see by the SF Bay points bleeding into the Sacramento area on the map. Perhaps they are trying to reach a bigger audience in the Bay.

We also notice that there are a few people in Reno / Tahoe posting ads in Sacramento and SF Bay. And there also appears to be somebody posting off the coast of Monterrey, but perhaps that is a data error :)

We could make one more improvement to this plot by using an alpha parameter to control the transparancy of the plotting points to better see density and bleeding into other regions.

# Split the data by city to see if location of poster is clustered around posting city
locationByCity = split(vposts[ , c("long", "lat")], vposts$city)

# Get color palette and then reverse the order darkest to lightest
col.palette = brewer.pal( length(locationByCity), "Purples")
col.palette = col.palette[ length(col.palette):1 ]

map('state', mar = c(0,0,0,0))
invisible( 
  lapply( 1:length(locationByCity), 
          FUN = function(x){
            points(x = locationByCity[[ x ]]$long, y = locationByCity[[ x ]]$lat, 
            pch = ".", cex = 5, col = col.palette[x]  )
        }
  ) 
)
legend("bottomleft", legend = names(locationByCity), col = col.palette, pch = 15, cex = 0.9)

map('state', region = 'california', mar = c(0,0,0,0))

points(x = locationByCity[[ "sac" ]]$long, y = locationByCity[[ "sac" ]]$lat,
       pch = ".", cex = 4, col = col.palette[1] )

points(x = locationByCity[[ "sfbay" ]]$long, y = locationByCity[[ "sfbay" ]]$lat,
       pch = ".", cex = 4, col = col.palette[5] )

# reset margins
par(mar = c(5, 4, 4, 2) + 0.1 )

Summarize the distribution of fuel type, drive, transmission, and vehicle type. Find a good way to display this information.¶

Notice that in the dotplot below, that the distributions in the different panels are all almost identical excpet that the distribtion shows some variation in the middle column where fuel type = "gas". Thus, we can essentially drop the fuel type from the plot, subset to just fuel type = "gas" and consider the relationship among the remaining three variables.

# 9 panels;  circle colors are the vehicle type
# dotplot( table(vposts$fuel, vposts$drive, vposts$transmission, vposts$type))

# 15 panels
dotplot( table(vposts$type, vposts$fuel, vposts$drive, vposts$transmission), 
         auto.key = list(columns = 3), par.settings = simpleTheme(cex=1.2, pch=16) )

# fuel = "gas" for almost all the data
table(vposts$fuel, useNA = "ifany")

## 
##   diesel electric      gas   hybrid    other     <NA> 
##     1071       68    29222      350     1187     2771

Subsetting to just fuel == "gas" we see that automatic transmissions are the most common across all types of cars and across drive trains, followed by manual then other. Within the rear wheel drive vehicles, manual stick shifts do have a higher percentage than other verhicle types for hatchback, coupe and convertible, which makes sense since coupe and convertibles tend to be sports cars and hatchbacks tend to be cheaper, entry level models. Among 4 wheel ddrive, offroad vehicles have a higher

idx = (vposts$fuel == "gas" & !is.na(vposts$fuel))

dotplot( 
  prop.table( table(vposts$type[idx], vposts$drive[idx], vposts$transmission[idx], useNA = "ifany"),
                     margin = c(1,2) ), 
  xlim = c(-0.05,1.05), auto.key = list(columns = 3), par.settings = simpleTheme(cex=1.2, pch=16),
  xlab = "Percent"
)

Previously, we fixed some prices and we also concluded that we would subset price to those prices >= 500 and <= 50,000, which captures about 94% of the data.

We should spend some time cleaning up the odometer readings. For example, the max odometer reading of 1234567890 is just some annoying poster. But to keep things simple here, we see that 99th percentile of odometer readings is 2.610^{5}, thus we will trim the data at 500,000 to capture almost all of the distribtion.

As we can see from the smoothScatter plot below, there generally does seem to be an upward trend for the overwhelming majority of the data. However, notice the dense shading between about 5 years of age and 20 years of age that have low odometer readings. Also notice that once age of the vehicle gets beyond about 25 years or so that there is actually a downward trend in that subset of the data. This is likely because very old antique cars just sit in some old man’s garage getting polished and buffed and only taken out for a ride during antique car shows.

quantile(vposts$odometer, probs = 0.99, na.rm = TRUE)

##    99% 
## 260000

idx = ( vposts$odometer < 500000 & !is.na(vposts$odometer) & !is.na(vposts$age) )
smoothScatter(x = vposts$age[idx], y = vposts$odometer[idx], xlab = "Age of Vehicle", 
              ylab = "Odometer Reading", main = "Odometer Readings by Age of Vehicle" )

As we can see in the smoothscatter plot below, there is a general negative relationship between odometer reading and price, but note that there are some very expensive cars with low odometer readings, many of those are antique cars.

idx = ( vposts$odometer < 500000 & vposts$price >= 500 & vposts$price <= 100000 &
          !is.na(vposts$odometer) & !is.na(vposts$price) )
smoothScatter(x = vposts$odometer[idx], y = vposts$price[idx], xlab = "Odometer Reading", 
              ylab = "Price of Vehicle", main = "Price by Odometer Readings of Vehicle" )

Identify the “old” cars. What manufacturers made these? What is the price distribution for these?¶

From the first smoothScatter plot below, it would seem that cars older than say 35 years are “old.”

From the table below, Chevy and Ford make up more than 50% of the “old” cars and contains most of the classic American car brands, as one might expect. In particular, there are not many “old” Japanese cars since the U.S. did not start importing Japanese cars on a large scale until the oil shocks of the 1970’s, whereas our cutoff for “old” is approximately 1970.

Comparing the price distribution of “old” cars compared to all cars, “old” cars seem have more density at higher prices with the majority of the prices less than \$40,000 whereas the bulk of the overall data tends to be less than \$20,000.

# Check overall age and price relationship
idx = (vposts$price >= 500 & vposts$price <= 100000 &
          !is.na(vposts$price) & !is.na(vposts$age) )

smoothScatter(x = vposts$age[idx], y = vposts$price[idx], xlab = "Age of Vehicle", 
              ylab = "Price of Vehicle", main = "Price by Age of Vehicle" )

# Look at makers and price distribution for "old" cars
idx = (vposts$age >= 35 & !is.na(vposts$age))

# Look at the 10 largest makers of "old" cars
round( sort( tail( sort(table(vposts$maker[ idx ])) / sum(table(vposts$maker[ idx ])), 10)
             , decreasing = TRUE), digits = 3 )

## 
##  chevrolet       ford volkswagen      dodge    pontiac   cadillac 
##      0.315      0.213      0.055      0.038      0.038      0.037 
##      buick        gmc   plymouth oldsmobile 
##      0.027      0.026      0.025      0.024

idx = (vposts$age >= 35 & !is.na(vposts$age) & !is.na(vposts$price))
smoothScatter(x = vposts$age[idx], y = vposts$price[idx], xlab = "Age of Vehicle", 
              ylab = "Vehicle Price", main = "Price by Age of Vehicle" )

I have omitted one important variable in this data set. What do you think it is? Can we derive this from the other variables? If so, sketch possible ideas as to how we would compute this variable.¶

When searching for cars on websites, the most important filters are usually year, make and model, in that order. Notice that year and make (i.e. maker) are stand-alone variables in the dataset, but model is not. However, notice that variable called header in the dataset is year, make, then model. Thus if we could parse the text string of each header to pull out the model, we could derive our own stand-alone variable for model. In fact, a future homework assignment will focus on using regular expressions to do just this type of string parsing.

head(vposts$header, 20)

##  [1] "2012 Chevrolet Camaro SS"                   
##  [2] "2013 Chevrolet Equinox LT"                  
##  [3] "2013 Nissan Altima 2.5 SV"                  
##  [4] "2009 Infiniti M35x X"                       
##  [5] "2013 Infiniti G37x X"                       
##  [6] "2012 Acura MDX 3.7L"                        
##  [7] "2010 Toyota Yaris"                          
##  [8] "2012 Acura RDX Base"                        
##  [9] "2014 Chevrolet Cruze 1LT Auto"              
## [10] "2009 Lexus IS 250"                          
## [11] "2013 Toyota Camry SE"                       
## [12] "2014 Honda CR-Z"                            
## [13] "2008 BMW 335 xi"                            
## [14] "2013 Nissan Juke SL"                        
## [15] "2013 Dodge Durango SXT"                     
## [16] "2012 Toyota Camry 4dr Sedan I4 Automatic SE"
## [17] "2012 Toyota Venza LE"                       
## [18] "2014 Nissan Juke S"                         
## [19] "2012 Toyota Corolla"                        
## [20] "2012 Honda Pilot EX-L"

# Print out conditions so we can cut and paste them into smaller
# categories. There's really no way out of this since a human has to decide
# what the new categories should be.
conditions = levels(vposts$condition)
conditions = sprintf('"%s",\n', conditions)
 cat(conditions)

## "0used",
##  "207,400",
##  "ac/heater",
##  "carfax guarantee!!",
##  "certified",
##  "complete parts car, blown engine",
##  "excellent",
##  "fair",
##  "front side damage",
##  "good",
##  "hit and run :( gently",
##  "honnda",
##  "like new",
##  "mint",
##  "muscle car restore",
##  "needs bodywork",
##  "needs restoration!",
##  "needs restored",
##  "needs total restore",
##  "needs work",
##  "needs work/for parts",
##  "new",
##  "nice",
##  "nice rolling restoration",
##  "nice teuck",
##  "not running",
##  "parts",
##  "pre owned",
##  "pre-owned",
##  "preowned",
##  "preownes",
##  "project",
##  "project car",
##  "rebuildable project",
##  "restoration",
##  "restoration project",
##  "restore",
##  "restored",
##  "rough but runs",
##  "salvage",
##  "superb original",
##  "used",
##  "very good",

# We'll base the new categories on the most common existing categories.
sort(table(vposts$condition))

## 
##                            0used                          207,400 
##                                1                                1 
##                        ac/heater complete parts car, blown engine 
##                                1                                1 
##                front side damage            hit and run :( gently 
##                                1                                1 
##                           honnda                             mint 
##                                1                                1 
##               muscle car restore               needs restoration! 
##                                1                                1 
##              needs total restore                       needs work 
##                                1                                1 
##             needs work/for parts                             nice 
##                                1                                1 
##         nice rolling restoration                       nice teuck 
##                                1                                1 
##                      not running                        pre-owned 
##                                1                                1 
##                         preownes                      project car 
##                                1                                1 
##              rebuildable project                      restoration 
##                                1                                1 
##              restoration project                          restore 
##                                1                                1 
##                         restored                   rough but runs 
##                                1                                1 
##               carfax guarantee!!                   needs restored 
##                                2                                2 
##                        pre owned                  superb original 
##                                2                                2 
##                   needs bodywork                          project 
##                                3                                3 
##                            parts                         preowned 
##                                5                               10 
##                        very good                        certified 
##                               10                               54 
##                          salvage                              new 
##                               61                              273 
##                             fair                             used 
##                              770                             1180 
##                         like new                             good 
##                             2898                             4663 
##                        excellent 
##                             7543

# We'll base the new categories on the most common existing categories.
sort(table(vposts$condition))

## 
##                            0used                          207,400 
##                                1                                1 
##                        ac/heater complete parts car, blown engine 
##                                1                                1 
##                front side damage            hit and run :( gently 
##                                1                                1 
##                           honnda                             mint 
##                                1                                1 
##               muscle car restore               needs restoration! 
##                                1                                1 
##              needs total restore                       needs work 
##                                1                                1 
##             needs work/for parts                             nice 
##                                1                                1 
##         nice rolling restoration                       nice teuck 
##                                1                                1 
##                      not running                        pre-owned 
##                                1                                1 
##                         preownes                      project car 
##                                1                                1 
##              rebuildable project                      restoration 
##                                1                                1 
##              restoration project                          restore 
##                                1                                1 
##                         restored                   rough but runs 
##                                1                                1 
##               carfax guarantee!!                   needs restored 
##                                2                                2 
##                        pre owned                  superb original 
##                                2                                2 
##                   needs bodywork                          project 
##                                3                                3 
##                            parts                         preowned 
##                                5                               10 
##                        very good                        certified 
##                               10                               54 
##                          salvage                              new 
##                               61                              273 
##                             fair                             used 
##                              770                             1180 
##                         like new                             good 
##                             2898                             4663 
##                        excellent 
##                             7543

# Define new categories.
new_cats = list(
  excellent = c("excellent"),
  good = c("good", "very good"),
  "like new" = c("like new", "mint", "new", "pre owned", "pre-owned", "preowned", "preownes"),
  used = c("0used", "used"),
  fair = c("fair", "nice", "nice teuck"),
  salvage = c("complete parts car, blown engine", "front side damage", "hit and run :( gently", 
              "muscle car restore", "needs bodywork", "needs restoration!", "needs restored",
              "needs total restore", "needs work", "needs work/for parts", "nice rolling restoration",
              "not running", "parts", "project", "project car", "rebuildable project", "restoration",
              "restoration project", "restore", "restored", "salvage", "rough but runs"),
  other = c("207,400", "ac/heater", "carfax guarantee!!", "certified", "honnda", "superb original" )
)

# Convert conditions to new categories.
vposts$new_cond = vposts$condition
levels(vposts$new_cond) = c(levels(vposts$new_cond), "other")

for (i in seq_along(new_cats)) {
  new_cat = names(new_cats)[[i]]
  vposts$new_cond[vposts$new_cond %in% new_cats[[i]]] = new_cat
}

vposts$new_cond = factor(vposts$new_cond)

# Restrict to odometer less than 5 million.
sane_odo = subset(vposts, odometer < 5e6)
odo_by_cond = split(sane_odo$odometer, sane_odo$new_cond)

boxplot(odo_by_cond, col = "salmon")
title("Odometer Distribution by Condition", ylab = "Miles")

# Make a second plot to show the distribution better.
boxplot(odo_by_cond, col = "skyblue1", ylim = c(0, 5e5))
title("Odometer Distribution by Condition", ylab = "Miles")

# Now we can see that the highest odometer readings seem to be for the "fair"
# and "good" conditions, which is a little surprising. It's possible people
# overstate the condition of the car when the odometer is higher, to try to
# make it sound more appealing. The condition with the most spread-out
# distribution is "salvage," which makes sense because salvage cars could be
# very old, or they could be new cars that were damaged somehow.

sane_price = subset(vposts, price < 2e5)
price_by_cond = split(sane_price$price, sane_price$new_cond)
boxplot(price_by_cond, col = "thistle")
title("Price Distribution by Condition", ylab = "Dollars")

# Make an age column.
vposts$age = 2015 - vposts$year

sane_age = subset(vposts, age < 200)
age_by_cond = split(sane_age$age, sane_age$new_cond)
boxplot(age_by_cond, col = "aquamarine")
title("Age Distribution by Condition", ylab = "Years")

# The price and age distributions don't show anything too surprising. Price and
# condition appear to be directly related, with salvage cars having the lowest
# prices. Cars that are "like new" or "excellent" are sometimes offered for
# extremely high prices, whereas this is less common with cars in worse
# condition. Age and condition, are inversely related: older cars seem to have
# worse conditions. As with odometer and condition, salvage cars exhibit the
# weakest relationship.