1 R语言基础运算

数据科学的对象主体肯定是数据。在这个大数据时代,任何可以被计算机所记录的都是数据。数据本身是具有数字属性的,那么它们就可以进行基本运算。R语言最初的设计的最基本单元应该是基础运算,大家可以把其想象成计算器

1.1 Basic Operation

命令 描述
+,-,*,\ 加, 减, 乘, 除
^ 取幂
%% 取余
round() 四舍五入
floor() 向下取整
ceiling() 向上取整
abs() 绝对值
sqrt() 平方根
log() 对数
exp() 指数
( ) 改变运算优先级

1.2 Logical Operation

命令 描述
&,|,! 与,或,非
== 判断是否相等

1.3 Assignment Operation

命令 描述
= 单向赋值
<- 双向赋值
<<- 给上一层环境变量赋值

1.4 Examples

22 %% 6
## [1] 4
floor(22/6)
## [1] 3
sqrt(22/6)
## [1] 1.914854
log(22/6)
## [1] 1.299283
3>4
## [1] FALSE
F == T
## [1] FALSE
  • 尝试用4个1计算出24

  • <- 双向赋值

asg <- 2
asg
## [1] 2
3 -> asg
asg
## [1] 3
  • = 单向传值
asg = 1
asg
## [1] 1
# 这是错误的写法
1 = asg  
  • = 函数参数传值

Usage
matrix(data = NA, nrow = 1, ncol = 1, ...)

matrix(1:6, 3)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

如何生成一个2 × 3的矩阵?

GOOD

matrix(1:6, ncol = 3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

BAD

matrix(1:6, ncol <- 3)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

BAD函数参数传值

ls() # 显示什么?

BAD函数参数传值

mean(x = 1:3)
## [1] 2
ls()
## [1] "asg"  "ncol"
mean(x <- 1:3)
## [1] 2
ls()
## [1] "asg"  "ncol" "x"

> 函数参数传值, = 可被替换为 <- , 反之则不然, 如: system.time(A <- matrix(1:6, 3))中只能用 <-

  • <<- 表示给上一层环境中的变量赋值

  • 建议:进行赋值运算时,使用 <- 而非 = .

2 R中的代码规范

3 R中的数据类型

3.1 Basic

原子型对象

  • 逻辑型:logical

  • 字符型:character

  • 整数型:integer

  • 实数型:numeric (double)

  • 复数型:complex

3.2 Examples

is.logical(T)
## [1] TRUE
is.logical(1)
## [1] FALSE
typeof(7)
## [1] "double"
is.numeric(7)
## [1] TRUE
is.integer(7)
## [1] FALSE
is.character("K")
## [1] TRUE
is.character(as.character("1"))
## [1] TRUE

4 R中的数据结构

4.1 Basic

非原子型对象

  • 向量:Vectors

  • 矩阵:Matrices

  • 数据框:Data Frames

  • 列表:Lists

  • 因子:Factors

  • 时间序列:Date

4.2 Vector

向量是什么?向量其实就是用于存储同一种类型数据的一维数组,它是所有数据结构中最基础最基本的形式。它的存储方式就像是一个房子下面的许多小房间,每一间存储着同种类型的元素。下面我们就通过向量的“基本操作”和“常见类型”两部分为大家详细讲解。

4.2.1 向量 - 创建

一般来说,采用函数c()即可完成向量的创建,只要在括号中输入每个向量元素就好啦!有同学要抗议了,这么做多费事,有没有快捷简单点的方法?捷径在于规律

如果你知道你的向量是以什么规律排列,你就有可能找到专为它量身打造的函数。比如你想创建的向量是等差数列,就可以用到seq函数;如果你想创建从a到b的连续整数,使用a:b就可搞定;如果你想从一串数字中随机抽取几个数,使用sample函数就可完美实现;如果你再会玩点,用上字符串的粘贴功能函数paste, 你就可以把字符和数字有规律的组合起来(比如,当你想批量命名变量时),下面就是一些实现效果:

vector(mode = "character", length = 3)
## [1] "" "" ""
v <- c(0, 1, -2)
typeof(v)
## [1] "double"
class(v)
## [1] "numeric"
length(v)
## [1] 3
a <- 1:10
a
##  [1]  1  2  3  4  5  6  7  8  9 10
b <- seq(1,10, by = 2)
b
## [1] 1 3 5 7 9
c <- c('a', 'b', 'c')
c
## [1] "a" "b" "c"
d <- sample(a, size = 20, replace = T)
d
##  [1]  8  8  3  2  1  8  8  1  7  4  7  2  4  3  5  5  6 10  8  6
e <- paste("x_", 1:5, collapse = ",")
e
## [1] "x_ 1,x_ 2,x_ 3,x_ 4,x_ 5"

4.2.2 向量 - 类型转换

as.character(v)
## [1] "0"  "1"  "-2"
as.logical(v)
## [1] FALSE  TRUE  TRUE
class(v)
## [1] "numeric"
class(as.logical(v))
## [1] "logical"

4.2.3 向量 - 索引

v[2]
## [1] 1
v[c(2, 3)]
## [1]  1 -2
v[2:3]
## [1]  1 -2
v[-2]
## [1]  0 -2
which(v==1)
## [1] 2
v[which(v < 1)]
## [1]  0 -2

4.2.4 向量 - 元素命名

names(v)
## NULL
names(v) <- c("Davide", "Linda", "Smale")
names(v)
## [1] "Davide" "Linda"  "Smale"
v[c("Davide", "Linda")]
## Davide  Linda 
##      0      1

4.2.5 向量 - 运算

t <- c(2, 1, 3)
names(v) <- c("Davide", "Linda", "Smale")
(v + t)/2
## Davide  Linda  Smale 
##    1.0    1.0    0.5
t > v 
## Davide  Linda  Smale 
##   TRUE  FALSE   TRUE
summary(v)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000 -1.0000  0.0000 -0.3333  0.5000  1.0000

4.2.6 向量 - 函数

命令 描述
sum(v) v的所有元素求和
mean(v) v的均值
median(v) v的中位数
min(v), max(v) v的最值
sd(v), var(v) v的标准差和方差
length(v) v中的元素个数
pmax(v1, v2), pmin(v1, v2) 示例: pmax(quiz1, quiz2) 返回每个学生两次测验中的较高成绩
sort(v) v排序
order(v) 返回 v排序后的索引
unique(v) v去重
summary(v) “五数”概述
any(v) 是否v中存在一个元素为TRUE
all(v) 是否v中所有元素均为TRUE
intersect 交集运算
union 并集运算
setdiff 差集运算
f <- c(1,2,3,2,1,1,1,2,3,4,5,1,3,1,2,3,1)
x <- letters[f]
x
##  [1] "a" "b" "c" "b" "a" "a" "a" "b" "c" "d" "e" "a" "c" "a" "b" "c" "a"
match(x, letters[1:5])
##  [1] 1 2 3 2 1 1 1 2 3 4 5 1 3 1 2 3 1
match(x, letters[1:7])
##  [1] 1 2 3 2 1 1 1 2 3 4 5 1 3 1 2 3 1
match(x, letters[1:3])
##  [1]  1  2  3  2  1  1  1  2  3 NA NA  1  3  1  2  3  1

4.2.7 向量-字符向量

字符这种变量类型不同于数值,它有很多独特的特征,有很多特别的小脾气需要些专用函数才能收好。先拿单独一个字符对象来说就有很多独特性,比如我们通常不太会对一个数值求它的长度,但对一个字符,我们很可能就需要了解它的长度,比如下面

length("欢迎关注狗熊会")
## [1] 1
nchar("欢迎关注狗熊会")
## [1] 56
nchar("Welcome to follow the Clubear")
## [1] 29

既然有了长度,我们就可以对字符进行切分,提取出一个子字符串,这时候会使用到的函数是substr,具体用法是substr(“char”, begin_position, end_position),比如下面的示例:

substr("Welcome to follow the Clubear", 3,9)
## [1] "lcome t"
paste(1:4)
## [1] "1" "2" "3" "4"
paste(1:4, collapse = "")
## [1] "1234"
paste(1:4, sep = "")
## [1] "1" "2" "3" "4"
paste(1:4, collapse = ",")
## [1] "1,2,3,4"
paste(1:4, sep = ",")
## [1] "1" "2" "3" "4"
paste(letters[1:4], 1:4)
## [1] "a 1" "b 2" "c 3" "d 4"
paste(letters[1:4], 1:4, collapse = "_")
## [1] "a 1_b 2_c 3_d 4"
paste(letters[1:4], 1:4, sep = "_")
## [1] "a_1" "b_2" "c_3" "d_4"
paste(letters[1:4], 1:4, collapse = "_",sep = "|")
## [1] "a|1_b|2_c|3_d|4"
t <- c("狗熊会", "Clubear", "Social Science")
grep("Social", t)
## [1] 3
# 求工资的均值
salary <- c("10w", "15w", "23w", "28w")
mean(salary)
## Warning in mean.default(salary): argument is not numeric or logical:
## returning NA
## [1] NA
  • 提示使用gsub函数

4.3 Matrix

4.3.1 矩阵 - 创建

M <- matrix(1:9, nrow = 3, ncol = 3)
print(M)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
length(M)
## [1] 9
dim(M)
## [1] 3 3
cbind(1:3, 4:6)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
rbind(1:3, 4:6)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
attributes(M)
## $dim
## [1] 3 3
I <- diag(1:4)
I
##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    2    0    0
## [3,]    0    0    3    0
## [4,]    0    0    0    4
vec <- 1:9
vec
## [1] 1 2 3 4 5 6 7 8 9
attributes(vec)
## NULL
dim(vec) <- c(3, 3)
vec
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

4.3.2 矩阵 - 索引

M[2,3]
## [1] 8
M[1, ]
## [1] 1 4 7
M[, 2]
## [1] 4 5 6

4.3.3 矩阵 - 运算

Q <- matrix(rep(10, 9), 3, 3)
M * Q
##      [,1] [,2] [,3]
## [1,]   10   40   70
## [2,]   20   50   80
## [3,]   30   60   90
M %*% Q
##      [,1] [,2] [,3]
## [1,]  120  120  120
## [2,]  150  150  150
## [3,]  180  180  180

4.3.4 矩阵 - 函数

colMeans(M)
## [1] 2 5 8
colSums(M)
## [1]  6 15 24
colnames(M)
## NULL
rownames(M)
## NULL

4.3.5 稀疏矩阵

# install.packages(Matrix)
library(Matrix)
vectors <- c(1:4, rep(0,4), 5,0,0,9)
m <- matrix(vectors, nrow = 4)
print(m)
##      [,1] [,2] [,3]
## [1,]    1    0    5
## [2,]    2    0    0
## [3,]    3    0    0
## [4,]    4    0    9
sm <- Matrix(vectors,nrow = 4, sparse = TRUE)
print(sm)
## 4 x 3 sparse Matrix of class "dgCMatrix"
##           
## [1,] 1 . 5
## [2,] 2 . .
## [3,] 3 . .
## [4,] 4 . 9
summary(sm)
## 4 x 3 sparse Matrix of class "dgCMatrix", with 6 entries 
##   i j x
## 1 1 1 1
## 2 2 1 2
## 3 3 1 3
## 4 4 1 4
## 5 1 3 5
## 6 4 3 9
sm <- spMatrix(nrow = 10,ncol = 20,i = 1:7,j = 1:7,x = 7*(1:7))
print(sm)
## 10 x 20 sparse Matrix of class "dgTMatrix"
##                                                    
##  [1,] 7  .  .  .  .  .  . . . . . . . . . . . . . .
##  [2,] . 14  .  .  .  .  . . . . . . . . . . . . . .
##  [3,] .  . 21  .  .  .  . . . . . . . . . . . . . .
##  [4,] .  .  . 28  .  .  . . . . . . . . . . . . . .
##  [5,] .  .  .  . 35  .  . . . . . . . . . . . . . .
##  [6,] .  .  .  .  . 42  . . . . . . . . . . . . . .
##  [7,] .  .  .  .  .  . 49 . . . . . . . . . . . . .
##  [8,] .  .  .  .  .  .  . . . . . . . . . . . . . .
##  [9,] .  .  .  .  .  .  . . . . . . . . . . . . . .
## [10,] .  .  .  .  .  .  . . . . . . . . . . . . . .
summary(sm)
## 10 x 20 sparse Matrix of class "dgTMatrix", with 7 entries 
##   i j  x
## 1 1 1  7
## 2 2 2 14
## 3 3 3 21
## 4 4 4 28
## 5 5 5 35
## 6 6 6 42
## 7 7 7 49

4.4 Data Frame

数据框是我们最常见的数据结构,一般来讲,我们从csv或txt文件读入时就会自动存储为数据框对象。该结构同样拥有表格状,但与矩阵所不同的是,矩阵只可以存储一种数据类型(比如,数值与字符型数据不能同时存在于矩阵中)。而实际中,我们看到的数据表格往往有很多类型,例如:字符、日期、数值包含在不同列中。想要在R中读入并表示这种数据,数据框就可以派上用场了。当然需要特别注意的是:数据框的每一列都必须还是同一种数据类型。如果不符合规定,R会在一定范围内强制转化数据类型,比如你输入的一列里既有文本又有数值,它会把该列强制转化成全是文本格式,这点大家要特别留意。

4.4.1 数据框 - 创建

nutrition <- data.frame(fruit = c("apples", "pears", "bananas"), calories = c(52, 57, 92))
head(nutrition, 3)
##     fruit calories
## 1  apples       52
## 2   pears       57
## 3 bananas       92
ncol(nutrition)
## [1] 2
nrow(nutrition)
## [1] 3
cbind(nutrition, nutrition)
##     fruit calories   fruit calories
## 1  apples       52  apples       52
## 2   pears       57   pears       57
## 3 bananas       92 bananas       92
rbind(nutrition, nutrition)
##     fruit calories
## 1  apples       52
## 2   pears       57
## 3 bananas       92
## 4  apples       52
## 5   pears       57
## 6 bananas       92

4.4.2 数据框 - 属性

names(nutrition)[1] <- "type"
attributes(nutrition)
## $names
## [1] "type"     "calories"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"
colnames(nutrition)
## [1] "type"     "calories"

4.4.3 数据框 - 概览

str(nutrition)
## 'data.frame':    3 obs. of  2 variables:
##  $ type    : Factor w/ 3 levels "apples","bananas",..: 1 3 2
##  $ calories: num  52 57 92
summary(nutrition)
##       type      calories   
##  apples :1   Min.   :52.0  
##  bananas:1   1st Qu.:54.5  
##  pears  :1   Median :57.0  
##              Mean   :67.0  
##              3rd Qu.:74.5  
##              Max.   :92.0

4.4.4 数据框 - 索引

nutrition[which(nutrition$type == "apples"), ]
##     type calories
## 1 apples       52
nutrition$calories # 返回结果为向量 
## [1] 52 57 92
nutrition[["calories"]] # 返回结果为向量
## [1] 52 57 92
nutrition["calories"] # 返回结果为数据框
##   calories
## 1       52
## 2       57
## 3       92

4.4.5 数据框 - 函数

calorie <- nutrition[["calories"]]
max(calorie)
## [1] 92

选出calories大于80的水果

##      type
## 3 bananas
  • 提示使用函数subset
row.names(nutrition)
## [1] "1" "2" "3"
rownames(nutrition)
## [1] "1" "2" "3"

4.5 List

4.5.1 列表 - 创建

list(
  c("john", "dave", "linda", "kino"), 
  c(183, 168, 177, 167), 
  c("M", "F", "M"))
## [[1]]
## [1] "john"  "dave"  "linda" "kino" 
## 
## [[2]]
## [1] 183 168 177 167
## 
## [[3]]
## [1] "M" "F" "M"
L <- list(
  name = c("john", "dave", "linda", "kino"), 
  height = c(183, 168, 177, 167), 
  gender = c("M", "F", "M"))

L
## $name
## [1] "john"  "dave"  "linda" "kino" 
## 
## $height
## [1] 183 168 177 167
## 
## $gender
## [1] "M" "F" "M"
str(L)
## List of 3
##  $ name  : chr [1:4] "john" "dave" "linda" "kino"
##  $ height: num [1:4] 183 168 177 167
##  $ gender: chr [1:3] "M" "F" "M"
Doraemon.bag <- list(
  char = "K",
  num = c(2, 2, 6),
  mat = matrix(6, 2, 2),
  df = data.frame(A = 1:3, I = 4:6),
  list = list(D = 1, M = 9)
)

str(Doraemon.bag)
## List of 5
##  $ char: chr "K"
##  $ num : num [1:3] 2 2 6
##  $ mat : num [1:2, 1:2] 6 6 6 6
##  $ df  :'data.frame':    3 obs. of  2 variables:
##   ..$ A: int [1:3] 1 2 3
##   ..$ I: int [1:3] 4 5 6
##  $ list:List of 2
##   ..$ D: num 1
##   ..$ M: num 9

4.5.2 列表 - 索引

L <- list(
  name = c("john", "dave", "linda", "kino"), 
  height = c(183, 168, 177, 167), 
  gender = c("M", "F", "M"))
L$name
## [1] "john"  "dave"  "linda" "kino"
L[["name"]]
## [1] "john"  "dave"  "linda" "kino"
L["name"]
## $name
## [1] "john"  "dave"  "linda" "kino"
c(is.vector(L$name), is.vector(L[["name"]]),   
  is.list(L["name"]))
## [1] TRUE TRUE TRUE

4.6 Factor

4.6.1 因子 - 创建与属性

f <- factor(c("yes", "yes", "no", "no", "yes"))

f
## [1] yes yes no  no  yes
## Levels: no yes
attributes(f)
## $levels
## [1] "no"  "yes"
## 
## $class
## [1] "factor"
f <- factor(
  c("yes", "yes", "no", "no", "yes"),    
  levels = c("yes", "no"))

f
## [1] yes yes no  no  yes
## Levels: yes no
attr(f, "levels") <- c("y", "n")

unclass(f)
## [1] 1 1 2 2 1
## attr(,"levels")
## [1] "y" "n"

4.6.2 因子 - cut

age <- sample(21:99, size = 50, replace = T)
age
##  [1] 66 91 75 50 90 66 52 85 90 72 35 66 50 68 91 55 74 40 21 64 37 45 53
## [24] 96 26 66 24 57 27 64 40 29 99 60 85 97 32 33 76 53 65 30 63 80 27 54
## [47] 82 30 58 44
age_label <- cut(age, breaks = c(20,45,75,100), labels = c("yong","middle","old"))
age_label
##  [1] middle old    middle middle old    middle middle old    old    middle
## [11] yong   middle middle middle old    middle middle yong   yong   middle
## [21] yong   yong   middle old    yong   middle yong   middle yong   middle
## [31] yong   yong   old    middle old    old    yong   yong   old    middle
## [41] middle yong   middle old    yong   middle old    yong   middle yong  
## Levels: yong middle old

4.7 Date

# 想知道今天日期
today <- Sys.Date()
# ISO 8601的标准yyyy-mm-dd
print(today)
## [1] "2018-07-10"
typeof(today)
## [1] "double"
class(today)
## [1] "Date"
# 字符串转化为时间格式
secondDay <- as.Date("2018-07-10")
print(secondDay)
## [1] "2018-07-10"
# 如果我们收集的数据里面的时间格式是07|10|2018如何处理?
thirdDay <- c("07|11|2018")
class(thirdDay)
## [1] "character"
print(thirdDay)
## [1] "07|11|2018"
thirdDay <- as.Date(thirdDay, format = "%m|%d|%y")
class(thirdDay)
## [1] "Date"
print(thirdDay)
## [1] "2020-07-11"
#如果我们收集的数据里面的时间格式是07|10|2018如何处理?
format参数 描述
%b 缩写月份Jan
%B 完整月份名称
%d 两位数的日期
%m 两位数的月份
%y 没有世纪的年份
%Y 有世纪的年份
today <- ISOdate(year = 2018,month = 07,day = 10)
print(today)
## [1] "2018-07-10 12:00:00 GMT"
class(today)
## [1] "POSIXct" "POSIXt"
today <- as.Date(today)
class(today)
## [1] "Date"
y <- 2015:2018
m <- 5:8
d <- 5:8
ISOdate(y,m,d)
## [1] "2015-05-05 12:00:00 GMT" "2016-06-06 12:00:00 GMT"
## [3] "2017-07-07 12:00:00 GMT" "2018-08-08 12:00:00 GMT"
# ISOdatetime可以扩展到“时,分,秒”
#时间是相对概念,所以R中选取的标准时间是1970-01-01
as.integer(today)
## [1] 17722
julian(today)
## [1] 17722
## attr(,"origin")
## [1] "1970-01-01"
as.integer(as.Date("1970-01-02"))
## [1] 1
# 提取日期一部分方法:把Date的对象转化为POSIXlt的对象。它是一个日期各个组成元素的一个列表。

today_list <- as.POSIXlt(today)
print(today_list)
## [1] "2018-07-10 UTC"
class(today_list)
## [1] "POSIXlt" "POSIXt"
today_list$mday
## [1] 10
today_list$mon # January = 0
## [1] 6
today_list$year
## [1] 118
today_list$year + 1900
## [1] 2018
POSIXlt列表的参数 描述
sec 秒数
min 分钟数
hour 小时数
mday 该月天数
mon 月份
year 自1990年起的年份
wday 该周的某一天
yday 该年的某一天
isdst 夏令时标记
# 创建日期序列
f <- as.Date("2018-07-01")
t <- as.Date("2018-07-10")
seq(from = f, to = t, by = 1)
##  [1] "2018-07-01" "2018-07-02" "2018-07-03" "2018-07-04" "2018-07-05"
##  [6] "2018-07-06" "2018-07-07" "2018-07-08" "2018-07-09" "2018-07-10"
# 日期的相关计算
#做差
#求任意两个日期距离的天数
begin=as.Date("2016-03-04")
end=as.Date("2016-05-08")
(during=end-begin)
## Time difference of 65 days
#求任意两个日期距离的周数和小时数
difftime(end,begin,units = "weeks")
## Time difference of 9.285714 weeks
difftime(end,begin,units = "hours")
## Time difference of 1560 hours

以上介绍的都是base基础包里自带的函数,下面要为大家介绍一款专门高效处理时间数据各种疑难杂症的包lubridate. 这是一个民间口碑极佳的数据预处理包包,曾屡次获得“能大幅提高数据预处理效率的精品”,“最让人感觉相见恨晚的包包”等美誉,足见它的群众基础多么深厚!lubridate包主要有两类函数,一类是处理时点数据,另一类是处理时段数据,不仅功能强大,而且相应函数也很直观易懂,比如把字符转化成时间类型,根本不需要输入匹配的format,比如提取时间数据细节,也只是一个小函数即可完美搞定,不附带任何参数。

## [1] "2009-01-01" "2009-01-02" "2009-01-03" "2009-01-04" "2009-01-05"
## [6] "2009-01-06" "2009-01-07"
## [1] 20
## [1] 6
## [1] 1
## [1] 30

4.8 Missing Value

  • NaN
  • NA
q <- c(1, NaN, NA, 4)
class(q)
## [1] "numeric"
is.nan(q)
## [1] FALSE  TRUE FALSE FALSE
is.na(q)
## [1] FALSE  TRUE  TRUE FALSE
mean(q)
## [1] NaN
mean(q, na.rm = T)
## [1] 2.5
is.na(NA)
is.na(NaN)
is.nan(NaN)
is.na(NA)

5 R中的数据读写

  • 读取csv
data <- read.csv("./data/data.csv", header = TRUE) 

data <- read.table("./data/data.txt", header = TRUE) 
  • 写入csv
write.table(data, file = "./data/dataout.txt")
write.csv(data, file = "./data/dataout.csv")

6 R中控制流

  • if, else
  • for and something new
  • while
  • repeat
  • break
  • next

6.1 if, else

s <- 1:5
ifelse(s > 3, yes = "Good", no = "Fair")
## [1] "Fair" "Fair" "Fair" "Good" "Good"
a <- 1

if(a > 2){
  b <- 1
}else{
  b <- 0
}
b
## [1] 0
a <- -1

if(a > 2){
  b <- 1
}else if(a < 0){
  b <- 1
}else{
  b <- 0
}
b
## [1] 1

6.2 for

for(i in 1:3){
  print(i) 
}
## [1] 1
## [1] 2
## [1] 3
M <- matrix(1:2, 2, 3)
M
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2
  • 对列求和
numeric(3)
## [1] 0 0 0
col.sum <- numeric(ncol(M))
for(j in 1:ncol(M)){
  for(i in 1:nrow(M)){
    col.sum[j] <- col.sum[j] + M[i, j]
  }
}
col.sum
## [1] 3 3 3
  • 对列求和
M
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2
  • 向量化运算
col.sum
## [1] 3 3 3

6.3 while

i <- 0
while(i < 3){
  print(i)
  i <- i + 1
}
## [1] 0
## [1] 1
## [1] 2

6.4 repeat and break

i <- 0
repeat{
  print(i)
  ifelse(i > 1, yes = break, no = i <- i + 1)
}
## [1] 0
## [1] 1
## [1] 2

6.6 Apply系列

print(M)
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2
apply(M, 2, sum)
## [1] 3 3 3
apply(M, 1, sum)
## [1] 3 6

6.6.1 lapply and sapply

L <- list(
  a = matrix(1:3, 1), 
  b = data.frame(
    char = c("A", "B"), 
    logic = c(T, F)), 
  c = list(
    1:3, 
    "T",
    matrix(1:6, 2, 3)))
L 
## $a
##      [,1] [,2] [,3]
## [1,]    1    2    3
## 
## $b
##   char logic
## 1    A  TRUE
## 2    B FALSE
## 
## $c
## $c[[1]]
## [1] 1 2 3
## 
## $c[[2]]
## [1] "T"
## 
## $c[[3]]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
lapply(L, class)
## $a
## [1] "matrix"
## 
## $b
## [1] "data.frame"
## 
## $c
## [1] "list"
sapply(L, class)
##            a            b            c 
##     "matrix" "data.frame"       "list"

6.6.2 mapply

mapply(rep, 1:6, 6:1)
## [[1]]
## [1] 1 1 1 1 1 1
## 
## [[2]]
## [1] 2 2 2 2 2
## 
## [[3]]
## [1] 3 3 3 3
## 
## [[4]]
## [1] 4 4 4
## 
## [[5]]
## [1] 5 5
## 
## [[6]]
## [1] 6

6.6.3 tapply

set.seed(6)
n <- 9
bike <- data.frame(
  orgn = sample(c("SE", "S", "N"), n, replace = T),
  dest = sample(c("SE", "S", "N"), n, replace = T),
  flux = round(rnorm(n, 18))) 

bike
##   orgn dest flux
## 1    S   SE   17
## 2    N    S   20
## 3   SE    N   17
## 4    S   SE   19
## 5    N   SE   18
## 6    N    N   17
## 7    N   SE   18
## 8    N    S   20
## 9    S    N   17
tapply(X = bike$flux, INDEX = bike$orgn, FUN = sum)
##  N  S SE 
## 93 53 17
##     
##      N S SE
##   N  1 2  2
##   S  1 0  2
##   SE 1 0  0
##     dest
## orgn  N  S SE
##   N  17 20 18
##   S  17 NA 18
##   SE 17 NA NA

6.6.4 小结:apply函数族

命令 描述
apply(X, MARGIN, FUN) 通过对一个数组或矩阵X的指定MARGIN应用FUN来得到一个向量/数组/列表
lapply(X, FUN) 通过对一个列表X的元素应用FUN 来得到一个列表
sapply(X, FUN) lapply的简化版本返回一个向量/数组而不是列表
mapply(FUN) lapply的多变元版本
tapply(X, INDEX, FUN) 通过对INDEX制定的因子组合应用FUN函数来得到一个表格

7 R中的函数

  • 定义
  • 参数
  • 封装

7.1 Definition

rsummary <- function(x) {
  return(
    list(
    mean = mean(x), 
    sd = sd(x),
    quantiles = quantile(x)))
}
x <- rnorm(100, mean = 6, sd = 1)
rsummary(x)
## $mean
## [1] 5.928524
## 
## $sd
## [1] 0.990941
## 
## $quantiles
##       0%      25%      50%      75%     100% 
## 4.047651 5.268584 5.818003 6.522913 8.608098
calGrade <- function(x, base = 59) {
  s <- x + base
  if(s >= 90) {
    grade <- "A"
  } else if(s >= 80) {
    grade <- "B"
  } else if(s >= 70) {
    grade <- "C"
  } else if(x >= 60){
    grade <- "D"
  }else{
    grade <- "Failed"
  }
  grade
}

scores <- c(40, 20, 27, 31, 0, 0)

calGrade(scores)
## [1] "A"
lapply(scores, calGrade)
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] "C"
## 
## [[3]]
## [1] "B"
## 
## [[4]]
## [1] "A"
## 
## [[5]]
## [1] "Failed"
## 
## [[6]]
## [1] "Failed"
  • return
square <- function(x) {
  return(x^2)
}

square(11)
## [1] 121
square <- function(x) {
  x^2
}

square(11)
## [1] 121

7.2 Parameter

  • formals
formals(calGrade)
## $x
## 
## 
## $base
## [1] 59
formals(plot)
## $x
## 
## 
## $y
## 
## 
## $...
  • 可以缺失,可能有缺省值

  • 参数匹配

(M <- matrix(1:6, nrow = 2)) # 精确匹配
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
(M <- matrix(1:6, nr = 2)) # 部分匹配
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
(M <- matrix(1:6, 2)) # 位置匹配
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

quantile(x, probs = seq(0, 1, 0.25), ...)
apply(X, MARGIN, FUN, ...)

set.seed(1)
(x <- matrix(rnorm(12), 3)) 
##            [,1]       [,2]      [,3]       [,4]
## [1,] -0.6264538  1.5952808 0.4874291 -0.3053884
## [2,]  0.1836433  0.3295078 0.7383247  1.5117812
## [3,] -0.8356286 -0.8204684 0.5757814  0.3898432
apply(x, 2, quantile)  # matrix
##            [,1]       [,2]      [,3]        [,4]
## 0%   -0.8356286 -0.8204684 0.4874291 -0.30538839
## 25%  -0.7310412 -0.2454803 0.5316052  0.04222742
## 50%  -0.6264538  0.3295078 0.5757814  0.38984324
## 75%  -0.2214052  0.9623943 0.6570530  0.95081220
## 100%  0.1836433  1.5952808 0.7383247  1.51178117

quantile(x, probs = seq(0, 1, 0.25), ...)
apply(X, MARGIN, FUN, ...)

set.seed(1)
(x <- matrix(rnorm(12), 3))
##            [,1]       [,2]      [,3]       [,4]
## [1,] -0.6264538  1.5952808 0.4874291 -0.3053884
## [2,]  0.1836433  0.3295078 0.7383247  1.5117812
## [3,] -0.8356286 -0.8204684 0.5757814  0.3898432
apply(x, 2, quantile, probs = c(.25, .75))
##           [,1]       [,2]      [,3]       [,4]
## 25% -0.7310412 -0.2454803 0.5316052 0.04222742
## 75% -0.2214052  0.9623943 0.6570530 0.95081220
str(paste)
## function (..., sep = " ", collapse = NULL)
  • GOOD
paste("Hi", " There", sep = "!")
## [1] "Hi! There"
  • BAD
paste("Hi", " There", se = "!")
## [1] "Hi  There !"
x <- 1:30
plot(x, log(x))

  • 自定义函数
my.plot <- 
  function(x, y, type = "l", ...){
     plot(x, y, type = type, ...)
}
my.plot(x, log(x))

7.3 Encapsulation

  • Obsessive Compulsive Disorder, OCD

  • 主文件调用函数文件
source("./fn/MyPlotEncap.R")  
my.plot.encap(x, log(x))

8 作业

第一部分:swirl包是一款专门的进行R语言课后练习的包,你可以按照包的提示去一点点学习R语言。本门课的作业部分,会有部分内容的作业是由这个包来完成。这个包的详细介绍请戳这里。整个包的代码库请戳这里。请按照下面操作,完成本作业。

knitr::include_graphics('fig/swirl.png')

首先,安装swirl包如下:

install.packages("swirl")

其次,启动swirl包如下:

# 载入swirl
library("swirl")

# 启动swirl
swirl()

这里会出现如下界面:

输入swirl()后,会继续出现

输入你自己的名字,然后会进入课程选择 好了,第二天作业,大家能完成1-5.

第二部分:

  • 观看狗熊会视频微课中的R语言入门视频,学习如何导入数据到R。

  • 下载Speed Dating的数据,数据描述与文章

  • 建立一个R Markdown文档

  • 读入Speed Dating的数据

  • 修改gender为因子类型

  • 计算试验男女比例

  • 计算男性在约会前对于自己心仪对象的各个维度评分的均值(attr1_1“,”sinc1_1“,”intel1_1“,”fun1_1“,”amb1_1“,”shar1_1“)

  • 计算女性约会前对于自己心仪对象的各个维度评分均值(attr1_1“,”sinc1_1“,”intel1_1“,”fun1_1“,”amb1_1“,”shar1_1)

  • 你有什么结论吗?

  • 思考还有什么问题可以研究?