Haste makes waste

Uda-DataAnalysis-29--钻石与价格预测

Posted on By lijun

3. 练习:散点图回顾

# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).

# Limit the x-axis and y-axis to omit the top 1% of values.
ggplot(aes(x=diamonds$carat,y=diamonds$price),data=diamonds) + 
  geom_point(fill=I("#F79420"),color=I("blue"),shape=21) + 
  xlim(0,quantile(diamonds$carat,0.99)) + 
  ylim(0,quantile(diamonds$price,0.99))

image

4. 练习:价格与克拉的关系

  • 上图的基础上添加了一个平滑曲线,但是如果用这个曲线去做预测的话,会丢失很多信息。
ggplot(aes(x=diamonds$carat,y=diamonds$price),data=diamonds) + 
  geom_point(fill=I("#F79420"),color=I("blue"),shape=21) + 
  xlim(0,quantile(diamonds$carat,0.99)) + 
  ylim(0,quantile(diamonds$price,0.99)) +
  stat_smooth(method = "lm")

image

7. 练习: Ggpairs 函数

#install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape2')
#install.packages('dplyr')

# load the ggplot graphics package and the others
library("ggplot2")
library("GGally")
library("scales")
library("memisc")

# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp)
ggsave("diamond_samp")

运行上述代码时出错了,使用如下的方式解决了:

remove.packages(c("ggplot2", "data.table"))
install.packages('Rcpp', dependencies = TRUE)
install.packages('ggplot2', dependencies = TRUE)
install.packages('data.table', dependencies = TRUE)

最终图形如下,可以看到price与x/y/z有很强的相关性:

image

8. 练习: 对钻石的需求

# Create two histograms of the price variable
# and place them side by side on one output image.

# We’ve put some code below to get you started.

# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.

# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.
  • 按照视频中的代码,没办法将两个plot排列起来,只能分开了,这里使用的是qplot描绘
qplot(data=diamonds,x=price,binwidth=100,fill=I("#099DD9")) + 
  ggtitle('Price')

image

qplot(data=diamonds,x=price,binwidth=0.01,fill=I("#F79420")) + 
  ggtitle('Price (log10)') + scale_x_log10()

image

  • 使用ggplot描画:
p1 <- ggplot(aes(x = diamonds$price), data = diamonds) + 
  geom_histogram(binwidth = 100,fill=I("#099DD9"),color=I("blue")) +
  ggtitle('Price') 
  
p2 <- ggplot(aes(x = log10(diamonds$price)), data = diamonds) +
  geom_histogram(binwidth = 0.01,fill=I("#F79420"),color=I("blue")) +
  ggtitle('Price (log10)') 

library(gridExtra)
grid.arrange(p1,p2,ncol=1)

image

9. 将需求与价格分布联系起来

qplot(data=diamonds,x=price,binwidth=0.05,fill=I("#F79420")) + 
  ggtitle('Price (log10)') + scale_x_log10() + facet_wrap(~cut)
  • 下图可以看出,价格较低,但是成色最好的钻石,有最好的销量。

image

10. 散点图转换

  • x轴是carat,y轴是price的对数(使用视频中的trans=log10_trans()出错了):
qplot(carat,price,data=diamonds) + ggtitle("Price log10 by carat") +scale_y_log10()

image

  • 没有使用自定义函数,下图跟视频要求不一致:
ggplot(aes(x = carat,price),data=diamonds) + 
  geom_point() + 
  scale_x_continuous(limits = c(0.2,3),
                     breaks = c(0.2,0.5,1,2,3)) +
  scale_y_continuous(limits = c(350,15000),
                     breaks = c(350,1000,5000,10000,15000)) +
  scale_y_log10() +
  ggtitle("Price (log10) by Cube-Root of Carat")

image

11. 练习: 复习过度绘制

观察上面一个图形,由于所有的点都堆积在一起,无法看出分布规律,调整一下绘图方式:

ggplot(aes(x = carat,price),data=diamonds) + 
  geom_point(alpha=0.1,size=0.75,position ="jitter") + 
  scale_x_continuous(limits = c(0.2,3),
                     breaks = c(0.2,0.5,1,2,3)) +
  scale_y_continuous(limits = c(350,15000),
                     breaks = c(350,1000,5000,10000,15000)) +
  scale_y_log10() +
  ggtitle("Price (log10) by Cube-Root of Carat")

image

通过上面的图就能看到关键区域和疏密程度了。

13. 练习: 价格与克拉和净度

上面考察了克拉与价格的关系,价格除了与克拉有关系外,与净度也有关系,下面使用净度来上色:

ggplot(aes(x = carat,price),data=diamonds) + 
  geom_point(alpha=0.5,size=0.75,position ="jitter",aes(color=diamonds$clarity)) + 
  scale_color_brewer(type="div",
                     guide = guide_legend(title="Clarity",reverse = TRUE,
                                          override.aes = list(alpha=1,size=2))) +
  scale_x_continuous(limits = c(0.2,3),
                     breaks = c(0.2,0.5,1,2,3)) +
  scale_y_continuous(limits = c(350,15000),
                     breaks = c(350,1000,5000,10000,15000)) +
  scale_y_log10() +
  ggtitle("Price (log10) by Cube-Root of Carat") + 
  theme(legend.position="right") 

image

15. 价格与克拉和切工

  • 与上面的绘制方式完全一致,只是将净度控制颜色换成了cut即切工控制颜色。
ggplot(aes(x = carat,price),data=diamonds) + 
  geom_point(alpha=0.5,size=0.75,position ="jitter",aes(color=diamonds$cut)) + 
  scale_color_brewer(type="div",
                     guide = guide_legend(title="Clarity",reverse = TRUE,
                                          override.aes = list(alpha=1,size=2))) +
  scale_x_continuous(limits = c(0.2,3),
                     breaks = c(0.2,0.5,1,2,3)) +
  scale_y_continuous(limits = c(350,15000),
                     breaks = c(350,1000,5000,10000,15000)) +
  scale_y_log10() +
  ggtitle("Price (log10) by Cube-Root of Cut") + 
  theme(legend.position="right") 

image

17. 价格与克拉和颜色

ggplot(aes(x = carat,price),data=diamonds) + 
  geom_point(alpha=0.5,size=0.75,position ="jitter",aes(color=diamonds$color)) + 
  scale_color_brewer(type="div",
                     guide = guide_legend(title="Clarity",reverse = TRUE,
                                          override.aes = list(alpha=1,size=2))) +
  scale_x_continuous(limits = c(0.2,3),
                     breaks = c(0.2,0.5,1,2,3)) +
  scale_y_continuous(limits = c(350,15000),
                     breaks = c(350,1000,5000,10000,15000)) +
  scale_y_log10() +
  ggtitle("Price (log10) by Cube-Root of color") + 
  theme(legend.position="right") 

image

20. 构建线性模型

参照資料:R 中的线性模型和运算符

m1 <- lm(I(log(price)) ~ I(carat^(1/3)),data = diamonds)
m2 <- update(m1,~ . + carat)
m3 <- update(m2,~ . + cut)
m4 <- update(m3,~ . + color)
m5 <- update(m4,~ . + clarity)

mtable(m1,m2,m3,m4,m5)
  • 输出:
Calls:
m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
    data = diamonds)
m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
    clarity, data = diamonds)

image

最终构建的模型如下:

image

最后一个参数是误差。

22. 练习: 更大、更好的数据集

从 https://github.com/solomonm/diamonds-data 下载 ,然后加载load("BigDiamonds.rda")。 过滤了1万美元以上的,以及限定鉴定机构为GIA。

diamondsbig$logprice = log(diamondsbig$price)


m1 <- lm(logprice ~ I(carat^(1/3)),
  data=diamondsbig[diamondsbig$price<10000 & 
                     diamondsbig$cert == "GIA",]
)
m2 <- update(m1,~ . + carat)
m3 <- update(m2,~ . + cut)
m4 <- update(m3,~ . + color)
m5 <- update(m4,~ . + clarity)

suppressMessages(library(lattice))
suppressMessages(library(MASS))
suppressMessages(library(memisc))
mtable(m1, m2, m3, m4, m5)

结果输出略

23. 预测

thisDiamond = data.frame(carat = 1.00,cut="V.Good",
                         color="I",clarity="VS1")

modelEstimate = predict(m5,newdata = thisDiamond,
                        interval = "prediction",level=.95)

exp(modelEstimate)
  • 输出:

fit 1 5040.436 lwr 3730.34 upr 6810.638