The data set contains about vehicles (cars, trucks, vans, ...) for sale on a Web site. Individuals (owners) or dealers create a post for a vehicle, and provide information about it such as the make, model and year of the vehicle, put it in a broad category for condition (e.g., excellent, good, ...), how many miles it has on the odometer, the type of vehicle (e.g. hatchback, sudan), what type of fuel it uses (e.g., gas or diesel). We also have information about the location of the poster/vehicle - the city, and often the latitude and longitude. Importantly, we have the price being sought by the seller. We also have a short free-form description of the vehicle.
# Load the data file called vehicles.rda from the subdirectory called data.
load("./data/vehicles.rda")
library(lattice)
library(maps)
library(ggplot2)
library(gmodels)
library(RColorBrewer)
# Create ownerDealer variable
vposts$ownerDealer = factor(vposts$byOwner, levels = c(TRUE, FALSE), labels = c("Owner", "Dealer"))
# We can get the number of observations by counting the number of rows
nrow(vposts)
## [1] 34677
# Alternatively, we could get the dimensions of the dataset and look at the number of rows (observations)
dim(vposts)
## [1] 34677 27
# Names of the variables
names(vposts)
## [1] "id" "title" "body" "lat"
## [5] "long" "posted" "updated" "drive"
## [9] "odometer" "type" "header" "condition"
## [13] "cylinders" "fuel" "size" "transmission"
## [17] "byOwner" "city" "time" "description"
## [21] "location" "url" "price" "year"
## [25] "maker" "makerMethod" "ownerDealer"
# Get the class of each variable using sapply() and then unlist() the results
unlist( sapply(X = vposts, FUN = class) )
## id title body lat long
## "character" "character" "character" "numeric" "numeric"
## posted1 posted2 updated1 updated2 drive
## "POSIXct" "POSIXt" "POSIXct" "POSIXt" "factor"
## odometer type header condition cylinders
## "integer" "factor" "character" "factor" "integer"
## fuel size transmission byOwner city
## "factor" "factor" "factor" "logical" "factor"
## time1 time2 description location url
## "POSIXct" "POSIXt" "character" "character" "character"
## price year maker makerMethod ownerDealer
## "integer" "integer" "character" "numeric" "factor"
# Alternative solution that is cleaner as suggested by Ian Heath on Piazza
print.table( sapply(X = vposts, FUN = class) )
## id title body lat
## character character character numeric
## long posted updated drive
## numeric POSIXct, POSIXt POSIXct, POSIXt factor
## odometer type header condition
## integer factor character factor
## cylinders fuel size transmission
## integer factor factor factor
## byOwner city time description
## logical factor POSIXct, POSIXt character
## location url price year
## character character integer integer
## maker makerMethod ownerDealer
## character numeric factor
# The densityplot (with rug plot) indicates that there are a few extreme outliers in the right tail.
densityplot(vposts$price, main = "Price", xlab = "Price")
# Let's see the 50 largest prices
tail( sort(vposts$price), 50 )
## [1] 95593 96590 97000 97500 97911 98000 99560
## [8] 99999 100000 100000 100000 100000 104800 105000
## [15] 105500 107000 112000 116100 116491 120000 122950
## [22] 123981 125000 129950 129990 138500 139000 139950
## [29] 143000 143000 143950 147000 149890 149995 150000
## [36] 152900 159000 169000 177588 202455 240000 286763
## [43] 359000 400000 559500 569500 9999999 30002500 600030000
## [50] 600030000
# For sure, the prices 9999999 30002500 600030000 600030000 are highly suspicious.
# Let's look at any car more than $100,000.
idx = which( vposts$price >= 100000 & !is.na(vposts$price) )
# There are 42 cars > $100,000
length( idx )
## [1] 42
# Let's order the row indices by price low-to-high
idx = idx[order(vposts[ idx, "price"])]
# Let's look at some of the actual data for those cars > $100,000
vposts[ idx, c("header", "price") ]
## header price
## posted12461 2000 Mack RD688S 100000
## posted24081 2003 lincoln navigator 100000
## posted181511 1965 porsche 911 100000
## posted245512 1961 Maserati 151 100000
## posted212510 2014 Porsche 911 104800
## posted13245 2010 ford fusion 105000
## posted7225 1967 chevrolet corvette 105500
## posted20867 2015 mercedes-benz s550 107000
## posted194311 2005 TOYOTA AVALON 112000
## posted222912 2015 Porsche Panamera 116100
## posted37310 2014 Audi RS 7 4.0T quattro 116491
## posted95215 1988 porsche 911 Carrera Targa TL 120000
## posted18546 2013 Mercedes-Benz G63 122950
## posted106214 2014 Land Rover Range 5.0L V8 123981
## posted212613 1941 willys 125000
## posted11066 2009 Lamborghini Gallardo 129950
## posted16934 2013 Isuzu NRR 129990
## posted191712 2015 Hyundai Sonata 138500
## posted112215 1976 Porsche 930 139000
## posted18506 2009 Mercedes-Benz SL65 139950
## posted163710 2015 Mercedes-Benz S-Class 143000
## posted231215 2015 Mercedes-Benz S-Class 143000
## posted214110 2011 Bentley Mulsanne 143950
## posted238613 2015 Porsche GT3 147000
## posted5038 2012 Mercedes-Benz SLS AMG 2dr Roadster SLS AMG 149890
## posted129214 2007 Lamborghini Gallardo Spyder 149995
## posted22621 2013 Ford 150000
## posted231114 2015 Porsche 911 152900
## posted220311 2011 toyota rav4 159000
## posted6747 2004 Lexus 470 169000
## posted698 2008 BMW X5 177588
## posted121313 2016 porsche 911 202455
## posted12630 2014 ferrari 458 italia 240000
## posted9976 2004 Toyota Corolla 286763
## posted1460 2010 CHEVROLET SILVERADO 359000
## posted23788 2006 FORD GT 400000
## posted21402 2009 CHEVROLET IMPALA 559500
## posted21422 2007 CHEVROLET MONTE CARLO 569500
## posted16005 2001 Honda Accord 9999999
## posted6903 2002 Caddy Seville sls 30002500
## posted22491 1969 Pontiac GTO 600030000
## posted23881 1969 Pontiac GTO 600030000
There are some very expensive cars in there like Mercedes-Benz G63 AMG, Bentley Mulsanne, Maserati 3500 GT, Porsche GT, etc. which are probably legitimately more than \$100,000. However, there are also cars in there such as “2015 Hyundai Sonata Call/SMS 650.445.0890 (12) - \$138500” which probably has an extra zero at the end and even a “2007 CHEVROLET MONTE CARLO LT - Easy Financing! Any Credit Auto Loans! BHPH! - \$569500” which probably has two extra zeros at the end. We will ignore them for now, but acknowledge that these are also problems and that they should be cleaned up.
# Let's look at the worst price errors
vposts[ vposts$price >= 9999999 & !is.na(vposts$price), c("header", "body")]
## header
## posted22491 1969 Pontiac GTO
## posted23881 1969 Pontiac GTO
## posted6903 2002 Caddy Seville sls
## posted16005 2001 Honda Accord
## body
## posted22491 \n We have 1968 & 1969 Pontiac GTO's.\nCurrently we are working on a 1968 end a 1969 Gto project is almost complete.\nOur Intention is the custom to specification by owner.\nCost will be between $6000 & $30,000. This will be depending on the car in the condition and the Owner financial capabilities. \nSerious inquires only inquiries only.. please call Tony at \n show contact info\n\n
## posted23881 \n We have 1968 & 1969 Pontiac GTO's.\nCurrently we are working on a 1968 end a 1969 Gto project is almost complete.\nOur Intention is the custom to specification by owner.\nCost will be between $6000 & $30,000. This will be depending on the car in the condition and the Owner financial capabilities. \nSerious inquires only inquiries only.. please call Tony at \n show contact info\n\n
## posted6903 \n clean, fully loaded, nice shine, good running engine and trans, willing to trade for old school or truck?????????????????? Mounted on 22 inch rims new tires no bends no cracks\n
## posted16005 \n Selling my car for some lunch money. $20 OBO. Comes with complimentary Oboe.\n
The two records with price of 600030000 are 1968 and 1969 Pontiac GTO - \$600030000, which can see from reading the body of the post that these are offers to customize GTO’s for somewhere between \$6,000 and \$30,000 and these values seem to be pasted together. We can revisit this in question #8, but for now, let’s exclude these.
Best guess for the Seville SLS with price 30002500 looks like a typo by the person who posted the ad. Perhaps they were originally thinking that they would sell it for \$3,000, but then changed their mind and lowered the price to \$2,500 but did not properly clear the field for price before they editing the posting price. We can revisit this in question #8, but for now, let’s exclude this.
The 2001 Honda Accord with price 9999999 looks like a deliberate misleading price by the person who posted the ad given that they failed to fill in several other fields.
As a side note, numbers like 99, or 999 or 99999999, etc. are often default entries for missing values, much like NA, #N/A, , Null, NULL, etc. Depending on the data, even numbers like -1 could be defined as missing data. We can revisit this in question #8, but for now, let’s exclude this.
For Question #3 here, all cars with a price greater than \$240,000 seem to have data errors so we will exclude all of them here.
We also note several other problems like lots of missing prices in general. These could be corrected one-by-one perhaps.
There are also lots of cars for sale at the price of \$1. This is a common advertising tactic to post for the minimum price since most people sort prices lowest-to-highest and thus these ads get seen at the top more often. Most of these are misleading ads by dealers, some are for car parts, some are offers for car financing. The same holds for almost all ads less than \$500. There is just too much data to clean up manually here so we will exclude them. SURELY we are throwing away some good data here and biasing our results.
# How many prices are NA?
sum( is.na(vposts$price) )
## [1] 3328
# How many cars for sale at $1?
sum( vposts$price == 1 & !is.na(vposts$price) )
## [1] 612
# Let's take a look 10% randomly selected $1 cars
idx = which( vposts$price == 1 & !is.na(vposts$price) )
idx = sample(x = idx, size = 60, replace = FALSE)
# Omitted here for brevity
# vposts[ idx, "body"]
# Only include prices between $500 and $240,000
idx = (vposts$price >= 500 & vposts$price <= 240000 & !is.na(vposts$price))
densityplot(vposts$price[ idx ], xlab = "Price", main = "Price")
# 94% of the data is between about $500 and $50,000
quantile(x = vposts$price, probs = c(0.05,0.99), na.rm = TRUE)
## 5% 99%
## 499 47000
# Final plot
avg = mean( vposts$price[ idx ] )
med = median( vposts$price[ idx ] )
dec = quantile(x = vposts$price[ idx ], probs = seq(from = 0.1, to = 0.9, by = 0.1) )
idx = (vposts$price >= 500 & vposts$price <= 50000 & !is.na(vposts$price))
plot(density(vposts$price[ idx ]), xlab = "Price", main = "Price", lwd = 2)
abline(h = 0)
points(x = c(avg, med), y = c(0,0), col = c("forestgreen", "darkorange"), pch = 18, cex = 3.0)
points(x = dec, y = rep(x = 0, times = length(dec)), col = "blue2", pch = 18, cex = 1.5)
legend("topright", legend = c("Mean", "Median", "Deciles"), col = c("forestgreen", "darkorange", "blue2"),
pch = 18)
# Different types. Note the NA's
names( table(vposts$type, useNA = "ifany") )
## [1] "bus" "convertible" "coupe" "hatchback" "mini-van"
## [6] "offroad" "other" "pickup" "sedan" "SUV"
## [11] "truck" "van" "wagon" NA
# Proportions table, sorted
sort( round( x = prop.table( x = table(vposts$type, useNA = "ifany") ), digits = 4) )
##
## bus offroad mini-van van wagon other
## 0.0006 0.0019 0.0131 0.0146 0.0161 0.0192
## convertible hatchback pickup truck coupe SUV
## 0.0204 0.0236 0.0262 0.0347 0.0469 0.1214
## sedan <NA>
## 0.2030 0.4583
# Let's plot the proportions
# Base dotchart() will not show the label NA and Lattice dotplot() will not plot the NA at all so force it
t = prop.table( x = table(vposts$type, useNA = "ifany") )
names(t)[ is.na(names(t)) ] = "NA"
# dotchart(x = sort(t), xlim = c(0,1), main = "Proportions of Car Types")
dotplot(x = sort(t), xlim = c(-0.05, 1.05), cex = 1.5, main = "Proportions of Car Types")
# Let's look at proportions *among* the non-missing
t = prop.table( x = table(vposts$type[ !is.na(vposts$type) ], useNA = "ifany") )
dotplot(x = sort(t), xlim = c(-0.05, 1.05), cex = 1.5, main = "Proportions of Car Types Among Non-Missing")
Close to half of the data is missing the vehicle type. Among those not missing the vehicle type, about 40% are sedans.
What we can see from the mosaic plots below overall and by transmission type and ignoring vehicle type other and fuel type other, is that gas vehicles dominate across vehicle types and across transmission types, with the notable exception that trucks have higher percent diesel than other vehicle types, as do buses with automatic transmissions - this is expected.
It might be easier to see these same relationships in dotplots than mosaicplots.
# Overall relationship fuel type and vehicle type
tbl = table(vposts$fuel, vposts$type)
row.order = order( rowSums( tbl ), decreasing = TRUE )
col.order = order( colSums( tbl ), decreasing = TRUE )
tbl = tbl[ row.order, col.order ]
col.palette = colorRampPalette(brewer.pal(9,"Blues"))(length(col.order))
mosaicplot(tbl, las = 2, color = col.palette,
main = "Overall Fuel Type by Vehicle Type", cex = 0.7
)
# Split by transmission type
fuelVehicleBytransmission = split( vposts[ , c("fuel", "type")], vposts$transmission)
invisible(
lapply( 1:length(fuelVehicleBytransmission),
FUN = function(x){
tbl = table(fuelVehicleBytransmission[[x]]$fuel, fuelVehicleBytransmission[[x]]$type)
row.order = order( rowSums( tbl ), decreasing = TRUE )
col.order = order( colSums( tbl ), decreasing = TRUE )
tbl = tbl[ row.order, col.order ]
# Get color palette and then reverse the order darkest to lightest
col.palette = colorRampPalette(brewer.pal(9,"Blues"))(length(col.order))
col.palette = col.palette[ length(col.palette):1 ]
mosaicplot(tbl, las = 2, color = col.palette,
main = paste0("Fuel and Vehicle by Transmission = ",
names(fuelVehicleBytransmission)[x]), cex = 0.7
)
}
)
)
dotplot(
prop.table( table(vposts$type , vposts$transmission, vposts$fuel, useNA = "ifany"),
margin = c(1,2) ),
xlim = c(-0.05,1.05), auto.key = list(columns = 3), par.settings = simpleTheme(cex=1.2, pch=16),
xlab = "Percent"
)
# levels() gives us the cities
levels(vposts$city)
## [1] "boston" "chicago" "denver" "lasvegas" "nyc" "sac"
## [7] "sfbay"
# length() tells us how many there are
length( levels(vposts$city) )
## [1] 7
It is a bit suspicious that all cities have roughly 5000 observations and that the percentages are almost perfectly 50/50 within each city. Smells like the data was filtered first…
Note that we also created a new variable called ownerDealer in the setup section.
# Each city has ~5000 observations
table(vposts$city)
##
## boston chicago denver lasvegas nyc sac sfbay
## 4958 4886 4979 4963 4983 4966 4942
# prop.table() with margin = 2 gives the column percentages
prop.table(table(vposts$ownerDealer, vposts$city), margin = 2)
##
## boston chicago denver lasvegas nyc sac
## Owner 0.4975797 0.4901760 0.4994979 0.4984888 0.4992976 0.5000000
## Dealer 0.5024203 0.5098240 0.5005021 0.5015112 0.5007024 0.5000000
##
## sfbay
## Owner 0.4991906
## Dealer 0.5008094
# Plot the counts from table() (Note: almost identical graphs as percentages)
plot( table(vposts$city, vposts$ownerDealer, useNA = "ifany"), main = "Owner/Dealer by City",
xlab = "City", ylab = "Owner or Dealer", col=c("darkblue","darkviolet"))
# Plot the percentages from prop.table (Note the switch of the variables)
plot( prop.table(table(vposts$city, vposts$ownerDealer), margin = 2), main = "Owner/Dealer by City",
xlab = "City", ylab = "Owner or Dealer", col=c("darkblue","darkviolet"))
# Use a dotplot since there are only two categories
tbl = prop.table( table(vposts$ownerDealer, vposts$city, useNA = "ifany"), margin = 2)
dotplot( sort(tbl[ 1, ]), xlim = c(-0.05, 1.05), xlab = "% Owner for Sale by Owner", ylab = "City",
main = "Percent for Sale by Owner by City", cex = 1.5)
The barplots basically capture the key information, but notice that the bar widths do not really represent anything meaningful. Since we are interested in the percentage for sale by owner within each city, a dotplot probably captures this the best since there are only two categories (byOwner = TRUE = owner, and byOwner = FALSE = dealer) and not missing data.
We can see from the tables and very clearly from the plots that the percentages of owner posting and dealer postings are almost perfectly 50/50 and this does not seem to vary across the different cities at all.
# Being pedantic, let's get the max price
max(vposts$price, na.rm = TRUE)
## [1] 600030000
# Let's remind ourselves of the few worst prices
idx = which( vposts$price >= 9999999 & !is.na(vposts$price) )
# Let's look at the actual data for those cars > $100,000
vposts[ idx, c("header", "price", "maker", "year") ]
## header price maker year
## posted22491 1969 Pontiac GTO 600030000 pontiac 1969
## posted23881 1969 Pontiac GTO 600030000 pontiac 1969
## posted6903 2002 Caddy Seville sls 30002500 cadillac 2002
## posted16005 2001 Honda Accord 9999999 honda 2001
# Let's fix the two records for the 1969 Pontiac GTO with price = 600030000
# We could remove it completely - maybe not the best approach.
# We could average the prices from the description, $6000 and $30,000 = $18,000 - not bad approach.
# We could research this car online and try to find some average price - pretty good approach.
# Let's see if we can't find a decent point estimate from within the dataset itself:
idx = ( vposts$maker == "pontiac" & vposts$year %in% c(1968, 1969) &
vposts$price < 9999999 & vposts$price > 1 &
grepl(pattern = "GTO", x = vposts$header, ignore.case = TRUE) &
!is.na(vposts$maker) & !is.na(vposts$price) & !is.na(vposts$header) )
dat = vposts[ idx, c("header", "price", "maker", "year") ]
dat[ order(dat$price), ]
## header price maker year
## posted4991 1968 pontiac gto 15995 pontiac 1968
## posted5371 1968 pontiac gto 15995 pontiac 1968
## posted231214 1968 Pontiac GTO 24500 pontiac 1968
## posted16497 1969 Pontiac GTO 25000 pontiac 1969
## posted201111 1969 Pontiac GTO 25000 pontiac 1969
## posted7355 1968 pontiac gto 30000 pontiac 1968
## posted16701 1968 Pontiac gto 38500 pontiac 1968
## posted40911 1968 GTO 38500 pontiac 1968
# Let's use a rounded mean price
newPrice = round( mean(vposts$price[idx]), digits = -3)
vposts$price[vposts$price == 600030000 & !is.na(vposts$price)] = newPrice
# Now let's fix the 2002 Caddy Seville sls with price = 30002500
# Let's see if we can't find a decent point estimate from within the dataset itself:
idx = ( vposts$maker == "cadillac" & vposts$year %in% c(2002) &
vposts$price < 9999999 & vposts$price > 1 &
grepl(pattern = "Seville", x = vposts$header, ignore.case = TRUE) &
!is.na(vposts$maker) & !is.na(vposts$price) & !is.na(vposts$header) )
dat = vposts[ idx, c("header", "price", "maker", "year") ]
dat[ order(dat$price), ]
## header price maker year
## posted20963 2002 cadillac seville 700 cadillac 2002
## posted20691 2002 Cadillac Seville 1500 cadillac 2002
## posted9323 2002 cadillac seville 2100 cadillac 2002
## posted16927 2002 cadillac seville 3495 cadillac 2002
# Average price estimate is lower than the posting price of $2500 or $3000, so use the lower $2500
round( mean(vposts$price[idx]), digits = -3)
## [1] 2000
vposts$price[vposts$price == 30002500 & !is.na(vposts$price)] = 2500
# The 2001 Honda Accord with price = 9999999 is a junk post
# "Selling my car for some lunch money. $20 OBO. Comes with complimentary Oboe."
# Remove it completely.
idx = which(vposts$price == 9999999 & !is.na(vposts$price))
vposts = vposts[ -idx, ]
Again, SURELY there are more that need to be fixed, but that was enough pain for one day.
cities = levels(vposts$city)
# We could do the inner function in one line, but it is hard to read so break it down into steps.
# names( head( sort( table(vposts$maker[ vposts$city == x & vposts$byOwner == y & !is.na(vposts$city) ]),
# decreasing = TRUE), 3) )
makeByCityByOwner = lapply(X = c(TRUE, FALSE), FUN = function(y){
sapply(X = cities, FUN = function(x){
t = table(vposts$maker[ vposts$city == x & vposts$byOwner == y & !is.na(vposts$city) ])
s = sort(t, decreasing = TRUE)
h = head(s, 3)
n = names(h)
})
})
names(makeByCityByOwner) = c("Owner", "Dealer")
makeByCityByOwner
## $Owner
## boston chicago denver lasvegas nyc sac
## [1,] "ford" "chevrolet" "ford" "ford" "nissan" "toyota"
## [2,] "honda" "ford" "chevrolet" "chevrolet" "toyota" "ford"
## [3,] "chevrolet" "honda" "toyota" "toyota" "honda" "chevrolet"
## sfbay
## [1,] "toyota"
## [2,] "honda"
## [3,] "ford"
##
## $Dealer
## boston chicago denver lasvegas nyc sac
## [1,] "ford" "chevrolet" "ford" "ford" "nissan" "ford"
## [2,] "toyota" "ford" "chevrolet" "nissan" "toyota" "toyota"
## [3,] "chevrolet" "nissan" "dodge" "chevrolet" "honda" "chevrolet"
## sfbay
## [1,] "toyota"
## [2,] "ford"
## [3,] "bmw"
# Check if the top in each city by owner and by dealer match
makeByCityByOwner$Owner[ 1, ] == makeByCityByOwner$Dealer[ 1, ]
## boston chicago denver lasvegas nyc sac sfbay
## TRUE TRUE TRUE TRUE TRUE FALSE TRUE
A quick visual scan shows that 2 of the top 3 makes by city for sale by owner are among the top 3 for sale by dealer within the same city. The top in each city by owner is the same as by dealer in all cities except SacTown. They are fairly similar.
There are a few clear data errors with years in c(4, 1900, 2022), potentially others as well. We could fix them, or just ignore them since there are only a few data points that will not affect our analysis much.
The 2022 Honda Odyssey “has only 117102 miles” so it is probably a typo for 2002, so let’s fix it that way.
The year = 1900 ads are for wheels and buying back cars that fail the smog test so they are not cars at all. Remove them.
The Jeep with year = 4 is probably 2004 since it has a “AM/FM cassette player-muli CD player”, so let’s fix it that way.
# Omitted here for brevity
# vposts[ vposts$year %in% c(4, 1900, 2022) & !is.na(vposts$year), ]
# Clean up
vposts$year[ vposts$year == 2022 & !is.na(vposts$year) ] = 2002
vposts = vposts[ -which(vposts$year == 1900 & !is.na(vposts$year)), ]
vposts$year[ vposts$year == 4 & !is.na(vposts$year) ] = 2004
# Create an age variable
vposts$age = 2016 - vposts$year
# Overall by owner has older cars, dealers have newer cars
histogram( ~ age | byOwner, data = vposts, main = "Age by Owner/Dealer", col = "lightblue")
# Subset to zoom in a bit on the plots
idx = ( vposts$age < 25 & !is.na(vposts$age))
# Zoomed in
histogram( ~ age | byOwner, data = vposts, subset = idx, main = "Age by Owner/Dealer Zoomed",
col = "lightblue")
# Looking by city, there is not much difference at all across cities for owner vs. dealer
histogram( ~ age | byOwner + city, data = vposts, subset = idx, main = "Age by City by Owner/Dealer Zoomed",
col = "lightblue")
It does seem that cars for sale by owners tend to be older than cars for sale by dealers. However, this does not seem to vary much by city.
Remember that the data was limited to just 7 major cities. Thus it would be a mistake to conclude something like almost all used cars on Craigslist (or from whatever website this data comes) sell in these 7 cities. We can conclude that the location of the people (and/or the cars themselves) who sell used cars in these major cities tend to be clustered fairly tightly around the major cities. Most buyers may be unlikely to travel far to look at, let alone, buy a used car.
There could be several explanations for the points far from one of the major cities. They could be travelling when they actaully posted the ad, for example. But in general, the location of the person posting the car is generally the same as the city in which they are attempting to sell the vehicle.
We also notice in the second plot of California that it seems that many people in Sac Town post their car on SF Bay Area, which we can see by the SF Bay points bleeding into the Sacramento area on the map. Perhaps they are trying to reach a bigger audience in the Bay.
We also notice that there are a few people in Reno / Tahoe posting ads in Sacramento and SF Bay. And there also appears to be somebody posting off the coast of Monterrey, but perhaps that is a data error :)
We could make one more improvement to this plot by using an alpha parameter to control the transparancy of the plotting points to better see density and bleeding into other regions.
# Split the data by city to see if location of poster is clustered around posting city
locationByCity = split(vposts[ , c("long", "lat")], vposts$city)
# Get color palette and then reverse the order darkest to lightest
col.palette = brewer.pal( length(locationByCity), "Purples")
col.palette = col.palette[ length(col.palette):1 ]
map('state', mar = c(0,0,0,0))
invisible(
lapply( 1:length(locationByCity),
FUN = function(x){
points(x = locationByCity[[ x ]]$long, y = locationByCity[[ x ]]$lat,
pch = ".", cex = 5, col = col.palette[x] )
}
)
)
legend("bottomleft", legend = names(locationByCity), col = col.palette, pch = 15, cex = 0.9)
map('state', region = 'california', mar = c(0,0,0,0))
points(x = locationByCity[[ "sac" ]]$long, y = locationByCity[[ "sac" ]]$lat,
pch = ".", cex = 4, col = col.palette[1] )
points(x = locationByCity[[ "sfbay" ]]$long, y = locationByCity[[ "sfbay" ]]$lat,
pch = ".", cex = 4, col = col.palette[5] )
# reset margins
par(mar = c(5, 4, 4, 2) + 0.1 )
Notice that in the dotplot below, that the distributions in the different panels are all almost identical excpet that the distribtion shows some variation in the middle column where fuel type = "gas"
. Thus, we can essentially drop the fuel type from the plot, subset to just fuel type = "gas"
and consider the relationship among the remaining three variables.
# 9 panels; circle colors are the vehicle type
# dotplot( table(vposts$fuel, vposts$drive, vposts$transmission, vposts$type))
# 15 panels
dotplot( table(vposts$type, vposts$fuel, vposts$drive, vposts$transmission),
auto.key = list(columns = 3), par.settings = simpleTheme(cex=1.2, pch=16) )
# fuel = "gas" for almost all the data
table(vposts$fuel, useNA = "ifany")
##
## diesel electric gas hybrid other <NA>
## 1071 68 29222 350 1187 2771
Subsetting to just fuel == "gas"
we see that automatic transmissions are the most common across all types of cars and across drive trains, followed by manual then other. Within the rear wheel drive vehicles, manual stick shifts do have a higher percentage than other verhicle types for hatchback, coupe and convertible, which makes sense since coupe and convertibles tend to be sports cars and hatchbacks tend to be cheaper, entry level models. Among 4 wheel ddrive, offroad vehicles have a higher
idx = (vposts$fuel == "gas" & !is.na(vposts$fuel))
dotplot(
prop.table( table(vposts$type[idx], vposts$drive[idx], vposts$transmission[idx], useNA = "ifany"),
margin = c(1,2) ),
xlim = c(-0.05,1.05), auto.key = list(columns = 3), par.settings = simpleTheme(cex=1.2, pch=16),
xlab = "Percent"
)
Previously, we fixed some prices and we also concluded that we would subset price to those prices >= 500 and <= 50,000, which captures about 94% of the data.
We should spend some time cleaning up the odometer readings. For example, the max odometer reading of 1234567890 is just some annoying poster. But to keep things simple here, we see that 99th percentile of odometer readings is 2.610^{5}, thus we will trim the data at 500,000 to capture almost all of the distribtion.
As we can see from the smoothScatter plot below, there generally does seem to be an upward trend for the overwhelming majority of the data. However, notice the dense shading between about 5 years of age and 20 years of age that have low odometer readings. Also notice that once age of the vehicle gets beyond about 25 years or so that there is actually a downward trend in that subset of the data. This is likely because very old antique cars just sit in some old man’s garage getting polished and buffed and only taken out for a ride during antique car shows.
quantile(vposts$odometer, probs = 0.99, na.rm = TRUE)
## 99%
## 260000
idx = ( vposts$odometer < 500000 & !is.na(vposts$odometer) & !is.na(vposts$age) )
smoothScatter(x = vposts$age[idx], y = vposts$odometer[idx], xlab = "Age of Vehicle",
ylab = "Odometer Reading", main = "Odometer Readings by Age of Vehicle" )
As we can see in the smoothscatter plot below, there is a general negative relationship between odometer reading and price, but note that there are some very expensive cars with low odometer readings, many of those are antique cars.
idx = ( vposts$odometer < 500000 & vposts$price >= 500 & vposts$price <= 100000 &
!is.na(vposts$odometer) & !is.na(vposts$price) )
smoothScatter(x = vposts$odometer[idx], y = vposts$price[idx], xlab = "Odometer Reading",
ylab = "Price of Vehicle", main = "Price by Odometer Readings of Vehicle" )
From the first smoothScatter plot below, it would seem that cars older than say 35 years are “old.”
From the table below, Chevy and Ford make up more than 50% of the “old” cars and contains most of the classic American car brands, as one might expect. In particular, there are not many “old” Japanese cars since the U.S. did not start importing Japanese cars on a large scale until the oil shocks of the 1970’s, whereas our cutoff for “old” is approximately 1970.
Comparing the price distribution of “old” cars compared to all cars, “old” cars seem have more density at higher prices with the majority of the prices less than \$40,000 whereas the bulk of the overall data tends to be less than \$20,000.
# Check overall age and price relationship
idx = (vposts$price >= 500 & vposts$price <= 100000 &
!is.na(vposts$price) & !is.na(vposts$age) )
smoothScatter(x = vposts$age[idx], y = vposts$price[idx], xlab = "Age of Vehicle",
ylab = "Price of Vehicle", main = "Price by Age of Vehicle" )
# Look at makers and price distribution for "old" cars
idx = (vposts$age >= 35 & !is.na(vposts$age))
# Look at the 10 largest makers of "old" cars
round( sort( tail( sort(table(vposts$maker[ idx ])) / sum(table(vposts$maker[ idx ])), 10)
, decreasing = TRUE), digits = 3 )
##
## chevrolet ford volkswagen dodge pontiac cadillac
## 0.315 0.213 0.055 0.038 0.038 0.037
## buick gmc plymouth oldsmobile
## 0.027 0.026 0.025 0.024
idx = (vposts$age >= 35 & !is.na(vposts$age) & !is.na(vposts$price))
smoothScatter(x = vposts$age[idx], y = vposts$price[idx], xlab = "Age of Vehicle",
ylab = "Vehicle Price", main = "Price by Age of Vehicle" )
When searching for cars on websites, the most important filters are usually year, make and model, in that order. Notice that year and make (i.e. maker) are stand-alone variables in the dataset, but model is not. However, notice that variable called header in the dataset is year, make, then model. Thus if we could parse the text string of each header to pull out the model, we could derive our own stand-alone variable for model. In fact, a future homework assignment will focus on using regular expressions to do just this type of string parsing.
head(vposts$header, 20)
## [1] "2012 Chevrolet Camaro SS"
## [2] "2013 Chevrolet Equinox LT"
## [3] "2013 Nissan Altima 2.5 SV"
## [4] "2009 Infiniti M35x X"
## [5] "2013 Infiniti G37x X"
## [6] "2012 Acura MDX 3.7L"
## [7] "2010 Toyota Yaris"
## [8] "2012 Acura RDX Base"
## [9] "2014 Chevrolet Cruze 1LT Auto"
## [10] "2009 Lexus IS 250"
## [11] "2013 Toyota Camry SE"
## [12] "2014 Honda CR-Z"
## [13] "2008 BMW 335 xi"
## [14] "2013 Nissan Juke SL"
## [15] "2013 Dodge Durango SXT"
## [16] "2012 Toyota Camry 4dr Sedan I4 Automatic SE"
## [17] "2012 Toyota Venza LE"
## [18] "2014 Nissan Juke S"
## [19] "2012 Toyota Corolla"
## [20] "2012 Honda Pilot EX-L"
# Print out conditions so we can cut and paste them into smaller
# categories. There's really no way out of this since a human has to decide
# what the new categories should be.
conditions = levels(vposts$condition)
conditions = sprintf('"%s",\n', conditions)
cat(conditions)
## "0used",
## "207,400",
## "ac/heater",
## "carfax guarantee!!",
## "certified",
## "complete parts car, blown engine",
## "excellent",
## "fair",
## "front side damage",
## "good",
## "hit and run :( gently",
## "honnda",
## "like new",
## "mint",
## "muscle car restore",
## "needs bodywork",
## "needs restoration!",
## "needs restored",
## "needs total restore",
## "needs work",
## "needs work/for parts",
## "new",
## "nice",
## "nice rolling restoration",
## "nice teuck",
## "not running",
## "parts",
## "pre owned",
## "pre-owned",
## "preowned",
## "preownes",
## "project",
## "project car",
## "rebuildable project",
## "restoration",
## "restoration project",
## "restore",
## "restored",
## "rough but runs",
## "salvage",
## "superb original",
## "used",
## "very good",
# We'll base the new categories on the most common existing categories.
sort(table(vposts$condition))
##
## 0used 207,400
## 1 1
## ac/heater complete parts car, blown engine
## 1 1
## front side damage hit and run :( gently
## 1 1
## honnda mint
## 1 1
## muscle car restore needs restoration!
## 1 1
## needs total restore needs work
## 1 1
## needs work/for parts nice
## 1 1
## nice rolling restoration nice teuck
## 1 1
## not running pre-owned
## 1 1
## preownes project car
## 1 1
## rebuildable project restoration
## 1 1
## restoration project restore
## 1 1
## restored rough but runs
## 1 1
## carfax guarantee!! needs restored
## 2 2
## pre owned superb original
## 2 2
## needs bodywork project
## 3 3
## parts preowned
## 5 10
## very good certified
## 10 54
## salvage new
## 61 273
## fair used
## 770 1180
## like new good
## 2898 4663
## excellent
## 7543
# We'll base the new categories on the most common existing categories.
sort(table(vposts$condition))
##
## 0used 207,400
## 1 1
## ac/heater complete parts car, blown engine
## 1 1
## front side damage hit and run :( gently
## 1 1
## honnda mint
## 1 1
## muscle car restore needs restoration!
## 1 1
## needs total restore needs work
## 1 1
## needs work/for parts nice
## 1 1
## nice rolling restoration nice teuck
## 1 1
## not running pre-owned
## 1 1
## preownes project car
## 1 1
## rebuildable project restoration
## 1 1
## restoration project restore
## 1 1
## restored rough but runs
## 1 1
## carfax guarantee!! needs restored
## 2 2
## pre owned superb original
## 2 2
## needs bodywork project
## 3 3
## parts preowned
## 5 10
## very good certified
## 10 54
## salvage new
## 61 273
## fair used
## 770 1180
## like new good
## 2898 4663
## excellent
## 7543
# Define new categories.
new_cats = list(
excellent = c("excellent"),
good = c("good", "very good"),
"like new" = c("like new", "mint", "new", "pre owned", "pre-owned", "preowned", "preownes"),
used = c("0used", "used"),
fair = c("fair", "nice", "nice teuck"),
salvage = c("complete parts car, blown engine", "front side damage", "hit and run :( gently",
"muscle car restore", "needs bodywork", "needs restoration!", "needs restored",
"needs total restore", "needs work", "needs work/for parts", "nice rolling restoration",
"not running", "parts", "project", "project car", "rebuildable project", "restoration",
"restoration project", "restore", "restored", "salvage", "rough but runs"),
other = c("207,400", "ac/heater", "carfax guarantee!!", "certified", "honnda", "superb original" )
)
# Convert conditions to new categories.
vposts$new_cond = vposts$condition
levels(vposts$new_cond) = c(levels(vposts$new_cond), "other")
for (i in seq_along(new_cats)) {
new_cat = names(new_cats)[[i]]
vposts$new_cond[vposts$new_cond %in% new_cats[[i]]] = new_cat
}
vposts$new_cond = factor(vposts$new_cond)
# Restrict to odometer less than 5 million.
sane_odo = subset(vposts, odometer < 5e6)
odo_by_cond = split(sane_odo$odometer, sane_odo$new_cond)
boxplot(odo_by_cond, col = "salmon")
title("Odometer Distribution by Condition", ylab = "Miles")
# Make a second plot to show the distribution better.
boxplot(odo_by_cond, col = "skyblue1", ylim = c(0, 5e5))
title("Odometer Distribution by Condition", ylab = "Miles")
# Now we can see that the highest odometer readings seem to be for the "fair"
# and "good" conditions, which is a little surprising. It's possible people
# overstate the condition of the car when the odometer is higher, to try to
# make it sound more appealing. The condition with the most spread-out
# distribution is "salvage," which makes sense because salvage cars could be
# very old, or they could be new cars that were damaged somehow.
sane_price = subset(vposts, price < 2e5)
price_by_cond = split(sane_price$price, sane_price$new_cond)
boxplot(price_by_cond, col = "thistle")
title("Price Distribution by Condition", ylab = "Dollars")
# Make an age column.
vposts$age = 2015 - vposts$year
sane_age = subset(vposts, age < 200)
age_by_cond = split(sane_age$age, sane_age$new_cond)
boxplot(age_by_cond, col = "aquamarine")
title("Age Distribution by Condition", ylab = "Years")
# The price and age distributions don't show anything too surprising. Price and
# condition appear to be directly related, with salvage cars having the lowest
# prices. Cars that are "like new" or "excellent" are sometimes offered for
# extremely high prices, whereas this is less common with cars in worse
# condition. Age and condition, are inversely related: older cars seem to have
# worse conditions. As with odometer and condition, salvage cars exhibit the
# weakest relationship.