Haste makes waste

Uda-DataAnalysis-28--习题集:探索多个变量

Posted on By lijun

1. 练习: 带有分面和颜色的价格直方图

  • 要求:
# Create a histogram of diamond prices.
# Facet the histogram by diamond color
# and use cut to color the histogram bars.

# The plot should look something like this.
# http://i.imgur.com/b5xyrOu.jpg

# Note: In the link, a color palette of type
# 'qual' was used to color the histogram using
# scale_fill_brewer(type = 'qual')
  • 代码与图形,按color切割为多个面,即多个图,按照cut区分各个直方图中的颜色:
ggplot(aes(x=price),data=diamonds) + 
	geom_histogram(aes(color=cut)) + 
	facet_wrap(~color,ncol = 2)

image

2. 练习: 价格与按切工填色的table

# Create a scatterplot of diamond price vs.
# table and color the points by the cut of
# the diamond.

# The plot should look something like this.
# http://i.imgur.com/rQF9jQr.jpg

# Note: In the link, a color palette of type
# 'qual' was used to color the scatterplot using
# scale_color_brewer(type = 'qual')
  • 散点图,scale_color_brewer(type = 'qual')指描绘使用的颜色种类,通过?scale_color_brewer看帮助。
ggplot(aes(x=price,y=table),data=diamonds) + geom_point(aes(color=cut)) + 
  scale_color_brewer(type = 'qual')

image

table 的含义是:width of top of diamond relative to widest point (43–95)

3. 练习: 典型表值

大多数完美切工钻石的典型表范围是多少? 大多数优质切工钻石的典型表范围是多少?在之前练习中创建的图表查看答案。无需进行汇总。

image

4. 练习: 价格与体积和钻石净度

# Create a scatterplot of diamond price vs.
# volume (x * y * z) and color the points by
# the clarity of diamonds. Use scale on the y-axis
# to take the log10 of price. You should also
# omit the top 1% of diamond volumes from the plot.

# Note: Volume is a very rough approximation of
# a diamond's actual volume.

# The plot should look something like this.
# http://i.imgur.com/excUpea.jpg

# Note: In the link, a color palette of type
# 'div' was used to color the scatterplot using
# scale_color_brewer(type = 'div')
diamonds$volumn <- diamonds$x * diamonds$y * diamonds$z

ggplot(aes(x=diamonds$volumn,y=log10(price)),data=diamonds) + 
  geom_point(aes(color=clarity)) + 
  xlim(0,quantile(diamonds$volumn,0.99))

image

5. 练习:新建友谊的比例 (使用ifelse)

# Your task is to create a new variable called 'prop_initiated'
# in the Pseudo-Facebook data set. The variable should contain
# the proportion of friendships that the user initiated.
pf$prop_initiated <- ifelse(pf$friend_count>0,pf$friendships_initiated / pf$friend_count,0)
summary(pf$prop_initiated)

image

6. 练习: Prop_initiated 与使用时长

# Create a line graph of the median proportion of
# friendships initiated ('prop_initiated') vs.
# tenure and color the line segment by
# year_joined.bucket.

# Recall, we created year_joined.bucket in Lesson 5
# by first creating year_joined from the variable tenure.
# Then, we used the cut function on year_joined to create
# four bins or cohorts of users.

# (2004, 2009]
# (2009, 2011]
# (2011, 2012]
# (2012, 2014]

# The plot should look something like this.
# http://i.imgur.com/vNjPtDh.jpg
# OR this
# http://i.imgur.com/IBN1ufQ.jpg
library("dplyr")

# ① 按照tenure分组数据
tenure_groups <- group_by(subset(pf,!is.na(tenure)), tenure) 

# ② 针对tenure_groups数据集,重新组织数据,注意这里不要使用`pf$prop_initiated`.
pf.fc_by_tenure <- summarise(tenure_groups,
                             median_prop = median(prop_initiated),
                             n=n())

# 根据tenure天数,计算加入的年份
pf.fc_by_tenure$year_joined <- 2014 - ceiling(pf.fc_by_tenure$tenure / 365)

# ③ 切断数据
pf.fc_by_tenure$year_joined.bucket <- cut(pf.fc_by_tenure$year_joined,breaks = c(2004,2009,2011,2012,2014))


ggplot(aes(x=tenure,y=median_prop),data=pf.fc_by_tenure) + 
  geom_line(aes(color=pf.fc_by_tenure$year_joined.bucket)) +
  scale_x_continuous(breaks = seq(0, 3500, 500)) +
  theme(legend.text=element_text(size=10),legend.title=element_text(size=10)) + theme(legend.position="top")

image

  • ① 按照tenure分组数据,比较分组后的数据和pf原始数据,分组后的数据再pf原始数据上增加了一些属性:
tenure_groups <- group_by(subset(pf,!is.na(tenure)), tenure) 

image

  • ② 针对tenure_groups数据集,重新组织数据

prop_initiated 参考上面一个问题

pf$prop_initiated <- ifelse(pf$friend_count>0,pf$friendships_initiated / pf$friend_count,0)
pf.fc_by_tenure <- summarise(tenure_groups,
                             median_prop = median(prop_initiated),
                             n=n())

head(pf.fc_by_tenure,1000)

image

  • ③ 切断数据

经过下面的数据后,数据结构变成:

pf.fc_by_tenure$year_joined <- 2014 - ceiling(pf.fc_by_tenure$tenure / 365)
pf.fc_by_tenure$year_joined.bucket <- cut(pf.fc_by_tenure$year_joined,breaks = c(2004,2009,2011,2012,2014))
head(pf.fc_by_tenure,1000)

image

7. 平滑化 prop_initiated 与使用时长

# Smooth the last plot you created of
# of prop_initiated vs tenure colored by
# year_joined.bucket. You can bin together ranges
# of tenure or add a smoother to the plot.

基于前一部分产生的数据,使用如下代码得到一个平滑线:

ggplot(aes(x=tenure,y=median_prop),data=pf.fc_by_tenure) + 
 
  scale_x_continuous(breaks = seq(0, 3500, 500)) +
  theme(legend.text=element_text(size=10),legend.title=element_text(size=10)) + theme(legend.position="top") +
  geom_smooth(aes(color = year_joined.bucket))

image

10. 经过分组、分面和填色的价格/克拉

# Create a scatter plot of the price/carat ratio
# of diamonds. The variable x should be
# assigned to cut. The points should be colored
# by diamond color, and the plot should be
# faceted by clarity.

# Note: In the link, a color palette of type
# 'div' was used to color the histogram using
# scale_color_brewer(type = 'div')

ggplot(aes(x=cut,y=price/carat),data=diamonds) + 
  geom_point(aes(color=color)) + 
  scale_color_brewer(type = 'div') +
  facet_wrap(~clarity) +
  theme(legend.position="right") 
ggsave("mtcars.png")

image

11. Gapminder 多变量分析