Haste makes waste

Uda-DataAnalysis-22-[扩展]-free R tutorial by datacamp

Posted on By lijun

根据Quick R的推荐,先使用R tutorial by datacamp入门。

笔记参考 notebook

1. Intro to basics

1.1 Arithmetic with R

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: /
  • Exponentiation 幂运算: ^
  • Modulo 取余: %%, 如 5 %% 3 is 2.
  • x %/% y,如7 %/% 3 = 2

image

1.2 Variable assignment

使用 <-赋值,如my_var <- 4

1.3 Basic data types in R

  • Decimals values like 4.5 are called numerics.
  • Natural numbers like 4 are called integers. Integers are also numerics.
  • Boolean values (TRUE or FALSE) are called logical.
  • Text (or string) values are called characters.
> # Declare variables of different types
> my_numeric <- 42
> my_character <- "universe"
> my_logical <- FALSE
> 
> # Check class of my_numeric
> class(my_numeric)
[1] "numeric"

> # Check class of my_character
> class(my_character)
[1] "character"
 
> # Check class of my_logical
> class(my_logical)
[1] "logical"
 

2. Vectors

2.1 Create a vector

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data.

vector使用c()连接函数创建,如下示例,注意vector中可以混合不同的数据类型:

> numeric_vector <- c(1, 10, 49)
> numeric_vector
[1]  1 10 49
> 
> character_vector <- c("a", "b", "c")
> character_vector
[1] "a" "b" "c"
> 

> boolean_vector <- c(TRUE,"a",TRUE)
> boolean_vector
[1] "TRUE" "a"    "TRUE"

2.2 Naming a vector

命名类似于给vector每个元素一个标签,使用names(roulette_vector) 函数。

> # Poker winnings from Monday to Friday
> poker_vector <- c(140, -50, 20, -120, 240)
> 
> # Roulette winnings from Monday to Friday
> roulette_vector <- c(-24, -50, 100, -350, 10)
> 
> # Assign days as names of poker_vector
> names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> poker_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      140       -50        20      -120       240
> 
> # Assign days as names of roulette_vectors
> names(roulette_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> roulette_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      -24       -50       100      -350        10

当然也可以定义一个标签的vector,然后将这个定义的变量赋值给已有vector的names()函数

> # The variable days_vector
> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> 
> # Assign the names of the day to roulette_vector and poker_vector
> names(poker_vector) <-   days_vector
> names(roulette_vector) <- days_vector

2.3 计算vector

  • 一维数组的计算:
> A_vector <- c(1, 2, 3)
> B_vector <- c(4, 5, 6)
> 
> # Take the sum of A_vector and B_vector
> total_vector <- A_vector + B_vector
> 
> # Print out total_vector
> total_vector
[1] 5 7 9
  • 如果维度不同会出现警告信息,但是可以计算:
> A_vector <- c(1, 2, 3,4,5)
> B_vector <- c(4, 5, 6)
 
> # Take the sum of A_vector and B_vector
> total_vector <- A_vector + B_vector
Warning message:
In A_vector + B_vector :
  longer object length is not a multiple of shorter object length

> # Print out total_vector
> total_vector
[1]  5  7  9  8 10
  • 如果计算用的vector有name属性,相加之后,name属性会被结果继承:
> # Poker and roulette winnings from Monday to Friday:
> poker_vector <- c(140, -50, 20, -120, 240)
> roulette_vector <- c(-24, -50, 100, -350, 10)
> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> names(poker_vector) <- days_vector
> names(roulette_vector) <- days_vector
 
> # Assign to total_daily how much you won/lost on each day
> total_daily <- poker_vector + roulette_vector
> total_daily
   Monday   Tuesday Wednesday  Thursday    Friday 
      116      -100       120      -470       250
> 
  • sum用于计算vector内各元素的和:
> # Poker and roulette winnings from Monday to Friday:
> poker_vector <- c(140, -50, 20, -120, 240)
> roulette_vector <- c(-24, -50, 100, -350, 10)
> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> names(poker_vector) <- days_vector
> names(roulette_vector) <- days_vector
 
> # Total winnings with poker
> total_poker <- sum(poker_vector)
 
> # Total winnings with roulette
> total_roulette <-  sum(roulette_vector)
 
> # Total winnings overall
> total_week <- total_poker + total_roulette
 
> # Print out total_week
> total_week
[1] -84

  • mean(poker_start)用于计算平均值。

2.4 通过下标取vector中的元素

这个取法太变态了,不是从0开始取值。

  • select the first element of the vector, you type poker_vector[1].
  • To select the second element of the vector, you type poker_vector[2]
  • 选取第一天和第五天: use the vector c(1, 5),poker_vector[c(1, 5)]
  • 选取第一天至第五天,poker_vector[1:5]
  • 还可以通过name 标签选取 poker_vector[c("Monday","Tuesday")]

2.5 通过比较运算符选取

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other
> c(4, 5, 6) > 5
[1] FALSE FALSE TRUE

通过计算出来的bool的vector,可以作为vector的选择器。这些功能都跟python的pandas库类似。

> poker_vector <- c(140, -50, 20, -120, 240)
> roulette_vector <- c(-24, -50, 100, -350, 10)
> days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
> names(poker_vector) <- days_vector
> names(roulette_vector) <- days_vector
> 
> selection_vector <- poker_vector > 0
> 
> poker_winning_days <- poker_vector[selection_vector]
> poker_winning_days
   Monday Wednesday    Friday 
      140        20       240

3. Matrices 矩阵

在R中,用矩阵表示一组相同数据类型元素的集合,给定一个固定的行和列,它是二维的。

使用函数matrix()来创建,例如matrix(1:9, byrow = TRUE, nrow = 3),表示:

  1. 第一个参数表示填充到矩阵的参数,c(1,2,3,4,5,6,7,8,9)与其等同。
  2. 第二个参数byrow,表示是否是按照行进行填充,TRUE的话,从行开始填充,否则从列开始填充
  3. 第三个参数表示行数目。
> # Construct a matrix with 3 rows that contain the numbers 1 up to 9
> matrix(1:9, byrow = TRUE, nrow = 3)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

> # Construct a matrix with 3 rows that contain the numbers 1 up to 9
> matrix(1:9, byrow = FALSE, nrow = 3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

3.1 分析矩阵

> # Box office Star Wars (in millions!)
> new_hope <- c(460.998, 314.4)
> empire_strikes <- c(290.475, 247.900)
> return_jedi <- c(309.306, 165.8)
 
> # Create box_office
> box_office <- c(new_hope,empire_strikes,return_jedi)
 
> # Construct star_wars_matrix
> star_wars_matrix <- matrix(box_office,byrow=TRUE,nrow=3)
> star_wars_matrix
        [,1]  [,2]
[1,] 460.998 314.4
[2,] 290.475 247.9
[3,] 309.306 165.8

上面的示例,先用c(new_hope,empire_strikes,return_jedi),将三个小的vector合并为一个,然后用这个合并后的vector去生成一个矩阵。

3.2 矩阵命名

类似上面的vector,也可以给矩阵命名,使用函数rownames()colnames()

# Box office Star Wars (in millions!)
> new_hope <- c(460.998, 314.4)
> empire_strikes <- c(290.475, 247.900)
> return_jedi <- c(309.306, 165.8)
> 
> # Construct matrix
> star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)
> 
> # Vectors region and titles, used for naming
> region <- c("US", "non-US")
> titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
> 
> # Name the columns with region
> colnames(star_wars_matrix) <- region
> 
> # Name the rows with titles
> rownames(star_wars_matrix) <- titles
> 
> # Print out star_wars_matrix
> star_wars_matrix
                             US non-US
A New Hope              460.998  314.4
The Empire Strikes Back 290.475  247.9
Return of the Jedi      309.306  165.8

3.3 计算

  • rowSums(),行求和 注意下面的matrix(),在定义矩阵的时候就直接给其命名了,另外通过rowSums计算每行数据的和,即每部电影的所有票房。
# Calculate worldwide box office figures
> worldwide_vector <- rowSums(star_wars_matrix)
> # Construct star_wars_matrix
> box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)

> star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                             dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
                                             c("US", "non-US")))
 
> star_wars_matrix
                             US non-US
A New Hope              460.998  314.4
The Empire Strikes Back 290.475  247.9
Return of the Jedi      309.306  165.8
 
> # Calculate worldwide box office figures
> worldwide_vector <- rowSums(star_wars_matrix)
> worldwide_vector
            A New Hope The Empire Strikes Back      Return of the Jedi 
              775.398                 538.375                 475.106
> 

3.4 矩阵合并

  • cbind(),列增加

使用函数cbind()对多个矩阵进行连接,rowSums计算得到的是行上的和,矩阵相加后,为每行数据上增加了一个列(sum)。

 # Construct star_wars_matrix
> box_office <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
> star_wars_matrix <- matrix(box_office, nrow = 3, byrow = TRUE,
                             dimnames = list(c("A New Hope", "The Empire Strikes Back", "Return of the Jedi"), 
                                             c("US", "non-US")))
> 
> # The worldwide box office figures
> worldwide_vector <- rowSums(star_wars_matrix)
> worldwide_vector
             A New Hope The Empire Strikes Back      Return of the Jedi 
                775.398                 538.375                 475.106
> 
> # Bind the new variable worldwide_vector as a column to star_wars_matrix
> all_wars_matrix <- cbind(star_wars_matrix,worldwide_vector)
> all_wars_matrix
                             US non-US worldwide_vector
A New Hope              460.998  314.4          775.398
The Empire Strikes Back 290.475  247.9          538.375
Return of the Jedi      309.306  165.8          475.106
  • rbind(),行增加
> # star_wars_matrix and star_wars_matrix2 are available in your workspace
> star_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
> star_wars_matrix2
                        US non-US
The Phantom Menace   474.5  552.5
Attack of the Clones 310.7  338.7
Revenge of the Sith  380.3  468.5
> 
> # Combine both Star Wars trilogies in one matrix
> all_wars_matrix <- cbind(star_wars_matrix,star_wars_matrix2)
> all_wars_matrix
                           US non-US    US non-US
A New Hope              461.0  314.4 474.5  552.5
The Empire Strikes Back 290.5  247.9 310.7  338.7
Return of the Jedi      309.3  165.8 380.3  468.5


 # Combine both Star Wars trilogies in one matrix
> all_wars_matrix <- rbind(star_wars_matrix,star_wars_matrix2)
> all_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the Sith     380.3  468.5
  • colSums(),列求和
 # all_wars_matrix is available in your workspace
> all_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the Sith     380.3  468.5
> 
> # Total revenue for US and non-US
> total_revenue_vector <- colSums(all_wars_matrix)
> 
> # Print out total_revenue_vector
> total_revenue_vector
    US non-US 
2226.3 2087.8

3.5 矩阵中的元素选取

  • my_matrix[1,2],选取第一行和第二列
  • my_matrix[1:3,2:4] ,选取1,2,3行 的 2,3,4列
  • my_matrix[,1],所有行的第一列
  • my_matrix[1,],所有列的第一行
> # all_wars_matrix is available in your workspace
> all_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the Sith     380.3  468.5
> 
> # Select the non-US revenue for all movies
> non_us_all <- all_wars_matrix[,2]
> 
> # Average non-US revenue
> mean(non_us_all)
[1] 347.9667
> 
> # Select the non-US revenue for first two movies
> non_us_some <- all_wars_matrix[1:2,2]
> 
> # Average non-US revenue for first two movies
> mean(non_us_some)
[1] 281.15

3.6 矩阵的数学运算

+, -, /, *,这些标准数学运算符,同样适合于矩阵中,针对矩阵中每个元素进行运算。

 # all_wars_matrix and ticket_prices_matrix are available in your workspace
> all_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the Sith     380.3  468.5
> ticket_prices_matrix
                         US non-US
A New Hope              5.0    5.0
The Empire Strikes Back 6.0    6.0
Return of the Jedi      7.0    7.0
The Phantom Menace      4.0    4.0
Attack of the Clones    4.5    4.5
Revenge of the Sith     4.9    4.9
> 
> # Estimated number of visitors
> visitors <- all_wars_matrix / ticket_prices_matrix
> 
> # US visitors
> us_visitors <- all_wars_matrix[,1]/ticket_prices_matrix[,1]
> 
> # Average number of US visitors
> mean(us_visitors)
[1] 75.01401

4. Factors(因素,因子)

factor是一种用于存储分类变量(categorical variables)的统计型数据类型,分类变量从属于一组有限个数的分类集合,比如性别。连续变量(continuous variable)对应无限的数据值。

4.1 factor()

使用factor()创建factor,如下示例中其'factor levels'Female和male

> # Gender vector
> gender_vector <- c("Male", "Female", "Female", "Male", "Male")
> 
> # Convert gender_vector to a factor
> factor_gender_vector <-factor(gender_vector)
> 
> # Print out factor_gender_vector
> factor_gender_vector
[1] Male   Female Female Male   Male  
Levels: Female Male

分类变量,可以分为两种类型,

  • nominal categorical variable,比如猩猩,大象,鳄鱼等分类之间没有等级和大小之分
  • ordinal categorical variable,比如大,中,小,分类之间有等级之分。

4.2 Factor levels

通过levels()可以给factor定义level,要注意其顺序

> # Code to build factor_survey_vector
> survey_vector <- c("M", "F", "F", "M", "M")
> factor_survey_vector <- factor(survey_vector)
> factor_survey_vector
[1] M F F M M
Levels: F M
> 
> # Specify the levels of factor_survey_vector
> levels(factor_survey_vector) <-c("Female", "Male")
> 
> factor_survey_vector
[1] Male   Female Female Male   Male  
Levels: Female Male

4.3 summary()概要函数

> # Build factor_survey_vector with clean levels
> survey_vector <- c("M", "F", "F", "M", "M")
> factor_survey_vector <- factor(survey_vector)
> levels(factor_survey_vector) <- c("Female", "Male")
> factor_survey_vector
[1] Male   Female Female Male   Male  
Levels: Female Male
> 
> # Generate summary for survey_vector
> summary(survey_vector)
   Length     Class      Mode 
        5 character character
> 
> # Generate summary for factor_survey_vector
> summary(factor_survey_vector)
Female   Male 
     2      3

4.4 factor的比较运算

factor在使用下标运算后,得到的factor,其level不变。factor之间不能进行比较运算。

 survey_vector <- c("M", "F", "F", "M", "M","N")
> factor_survey_vector <- factor(survey_vector)
> levels(factor_survey_vector) <- c("Female", "Male","Newhalf")
> 
> # Male
> male <- factor_survey_vector[1]
> male
[1] Male
Levels: Female Male Newhalf
> 
> # Female
> female <- factor_survey_vector[2]
> female
[1] Female
Levels: Female Male Newhalf
> 
> # Battle of the sexes: Male 'larger' than female?
> male > female
Warning message: '>' not meaningful for factors
[1] NA

4.5 Ordered factors

在前面我们说了factor分为两类,一类是没有等级大小之分的,还有一类是有的,通过factor()函数默认建立的是前者即无等级之分的,如果要创建第二种,需要传入参数:

factor(some_vector,
       ordered = TRUE,
       levels = c("lev1", "lev2" ...))
> # Create speed_vector
> speed_vector <- c("fast", "slow", "slow", "fast", "insane")
> 
> # Convert speed_vector to ordered factor vector
> factor_speed_vector <- factor(speed_vector,ordered = TRUE,levels=c("slow","fast","insane"))
> 
> # Print factor_speed_vector
> factor_speed_vector
[1] fast   slow   slow   fast   insane
Levels: slow < fast < insane
> summary(factor_speed_vector)
  slow   fast insane 
     2      2      1

4.6 比较ordered facotrs

ordered factors在定义的时候,已经给出了顺序,所以能使用比较运算符

> # Create factor_speed_vector
> speed_vector <- c("fast", "slow", "slow", "fast", "insane")
> factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "fast", "insane"))
> 
> # Factor value for second data analyst
> da2 <- factor_speed_vector[2]
> 
> # Factor value for fifth data analyst
> da5 <- factor_speed_vector[5]
> 
> # Is data analyst 2 faster than data analyst 5?
> da2 > da5
[1] FALSE

5. Data frames

类似二维矩阵的,但是可以混合多种数据类型

  • head()和tail()分别取数据的前部分,和后面部分

 mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
> tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2
> 
  • 通过str()也可以获取data frame的概要
> # Investigate the structure of mtcars
> str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

5.1 创建 data frame data.frame()

> # Definition of vectors
> name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
> type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", 
            "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
> diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
> rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
> rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
 
> # Create a data frame from the vectors
> planets_df <- data.frame(name,type,diameter,rotation,rings)
> planets_df
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
3   Earth Terrestrial planet    1.000     1.00 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
5 Jupiter          Gas giant   11.209     0.41  TRUE
6  Saturn          Gas giant    9.449     0.43  TRUE
7  Uranus          Gas giant    4.007    -0.72  TRUE
8 Neptune          Gas giant    3.883     0.67  TRUE

> # Check the structure of planets_df
> str(planets_df)
'data.frame':	8 obs. of  5 variables:
 $ name    : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
 $ type    : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
 $ diameter: num  0.382 0.949 1 0.532 11.209 ...
 $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
 $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...

取元素的方法,与矩阵的下标方式完全一致,如planets_df[1:5,"diameter"],取diameter列的第1到5行数据。

另外还有一种简单的方式如planets_df$rings,通过$直接获取了rings列的数据。

通过一组bool的vector,获取对应为TRUE的元素

> # Adapt the code to select all columns for planets with rings
> rings_vector
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
> planets_df[rings_vector, ]
     name      type diameter rotation rings
5 Jupiter Gas giant   11.209     0.41  TRUE
6  Saturn Gas giant    9.449     0.43  TRUE
7  Uranus Gas giant    4.007    -0.72  TRUE
8 Neptune Gas giant    3.883     0.67  TRUE
> 

5.2 通过subset()获取子dataframe

> subset(planets_df, subset = diameter < 1)
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
> 

获取直径大于1的数据。

5.3 排序

下面是数值型,如果是文字列,则按字母顺序排序。

> a <- c(100, 10, 1000)
> order(a)
[1] 2 1 3
> a[order(a)]
[1]   10  100 1000

上面是一维vector排序,下面是二维dataframe的排序,注意下标取元素时与vector的差异

# planets_df is pre-loaded in your workspace
> 
> # Use order() to create positions
> positions <-  order(planets_df$diameter)
> positions
[1] 1 4 2 3 8 7 6 5
> 
> # Use positions to sort planets_df
> planets_df[positions,]
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
3   Earth Terrestrial planet    1.000     1.00 FALSE
8 Neptune          Gas giant    3.883     0.67  TRUE
7  Uranus          Gas giant    4.007    -0.72  TRUE
6  Saturn          Gas giant    9.449     0.43  TRUE
5 Jupiter          Gas giant   11.209     0.41  TRUE
> 

6. Lists

先总结下上面学习过的数据类型:

  • vectors,能存储数值,字符和bool,vector中的数据有相同的数据类型
  • matrices,二维,也只能存储相同数据类型
  • data frames,二维,同一列数据类型相同,但是不同列之间可以是不同的数据类型

6.1 生成一个list,list()

list中可以存储完全不同的数据结构。

 # Vector with numerics from 1 up to 10
> my_vector <- 1:10
> 
> # Matrix with numerics from 1 up to 9
> my_matrix <- matrix(1:9, ncol = 3)
> 
> # First 10 elements of the built-in data frame mtcars
> my_df <- mtcars[1:10,]
> 
> # Construct list with these different elements:
> my_list <- list(my_vector,my_matrix,my_df)
> my_list
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

[[3]]
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

6.2 给list命名

两种方式:

# 定义时命名
my_list <- list(name1 = your_comp1, 
                name2 = your_comp2)

# 定义之后再命名
my_list <- list(your_comp1, your_comp2)
names(my_list) <- c("name1", "name2")
 # Finish the code to build shining_list
> shining_list <- list(moviename = mov,actors=act,reviews=rev)
> shining_list
$moviename
[1] "The Shining"

$actors
[1] "Jack Nicholson"   "Shelley Duvall"   "Danny Lloyd"      "Scatman Crothers"
[5] "Barry Nelson"    

$reviews
  scores sources                                              comments
1    4.5   IMDb1                     Best Horror Film I Have Ever Seen
2    4.0   IMDb2 A truly brilliant and scary film from Stanley Kubrick
3    5.0   IMDb3                 A masterpiece of psychological horror

6.3 从list中取得元素

比如,如果要取得上面的reviews,可以使用shining_list[["reviews"]]shining_list$reviews, 如果取得一个特定的元素 shining_list[[2]][1],将取得 "Jack Nicholson",即列表第二组元素中的第一个。

6.4 向list中添加movie信息

ext_list <- c(my_list, my_name = my_val)

向my_list中添加一个新的my_val元素,生成新的list ext_list

下面的示例中,向shining_list中添加name为year的元素,值为1980

> shining_list_full <- c(shining_list,year=1980)
> 
> # Have a look at shining_list_full
> str(shining_list_full)
List of 4
 $ moviename: chr "The Shining"
 $ actors   : chr [1:5] "Jack Nicholson" "Shelley Duvall" "Danny Lloyd" "Scatman Crothers" ...
 $ reviews  :'data.frame':	3 obs. of  3 variables:
  ..$ scores  : num [1:3] 4.5 4 5
  ..$ sources : Factor w/ 3 levels "IMDb1","IMDb2",..: 1 2 3
  ..$ comments: Factor w/ 3 levels "A masterpiece of psychological horror",..: 3 2 1
 $ year     : num 1980
> 
>