1 R语言基础运算

数据科学的对象主体肯定是数据。在这个大数据时代，任何可以被计算机所记录的都是数据。数据本身是具有数字属性的，那么它们就可以进行基本运算。R语言最初的设计的最基本单元应该是基础运算，大家可以把其想象成计算器。

1.1 Basic Operation

命令	描述
`+,-,*,\`	加, 减, 乘, 除
`^`	取幂
`%%`	取余
`round()`	四舍五入
`floor()`	向下取整
`ceiling()`	向上取整
`abs()`	绝对值
`sqrt()`	平方根
`log()`	对数
`exp()`	指数
`( )`	改变运算优先级

1.2 Logical Operation

命令	描述
`&,\|,!`	与,或,非
`==`	判断是否相等

1.3 Assignment Operation

命令	描述
`=`	单向赋值
`<-`	双向赋值
`<<-`	给上一层环境变量赋值

1.4 Examples

22 %% 6

## [1] 4

floor(22/6)

## [1] 3

sqrt(22/6)

## [1] 1.914854

log(22/6)

## [1] 1.299283

3>4

## [1] FALSE

F == T

## [1] FALSE

尝试用4个1计算出24
<- 双向赋值

asg <- 2

asg

## [1] 2

3 -> asg

asg

## [1] 3

= 单向传值

asg = 1

asg

## [1] 1

# 这是错误的写法
1 = asg

= 函数参数传值

Usage
matrix(data = NA, nrow = 1, ncol = 1, ...)

matrix(1:6, 3)

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

如何生成一个2 × 3的矩阵？

GOOD

matrix(1:6, ncol = 3)

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

BAD

matrix(1:6, ncol <- 3)

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

BAD函数参数传值

ls() # 显示什么？

BAD函数参数传值

mean(x = 1:3)

## [1] 2

ls()

## [1] "asg"  "ncol"

mean(x <- 1:3)

## [1] 2

ls()

## [1] "asg"  "ncol" "x"

> 除函数参数传值外, = 可被替换为 <- , 反之则不然, 如: system.time(A <- matrix(1:6, 3))中只能用 <-

<<- 表示给上一层环境中的变量赋值
建议：进行赋值运算时，使用 <- 而非 = .

2 R中的代码规范

阅读狗熊会的人才计划推文：代码的规范。
Google’s R StyleGuide

3 R中的数据类型

3.1 Basic

原子型对象

逻辑型：logical
字符型：character
整数型：integer
实数型：numeric (double)
复数型：complex

3.2 Examples

is.logical(T)

## [1] TRUE

is.logical(1)

## [1] FALSE

typeof(7)

## [1] "double"

is.numeric(7)

## [1] TRUE

is.integer(7)

## [1] FALSE

is.character("K")

## [1] TRUE

is.character(as.character("1"))

## [1] TRUE

4 R中的数据结构

4.1 Basic

非原子型对象

向量：Vectors
矩阵：Matrices
数据框：Data Frames
列表：Lists
因子：Factors
时间序列：Date

4.2 Vector

向量是什么？向量其实就是用于存储同一种类型数据的一维数组，它是所有数据结构中最基础最基本的形式。它的存储方式就像是一个房子下面的许多小房间，每一间存储着同种类型的元素。下面我们就通过向量的“基本操作”和“常见类型”两部分为大家详细讲解。

4.2.1 向量 - 创建

一般来说，采用函数c()即可完成向量的创建，只要在括号中输入每个向量元素就好啦！有同学要抗议了，这么做多费事，有没有快捷简单点的方法？捷径在于规律。

如果你知道你的向量是以什么规律排列，你就有可能找到专为它量身打造的函数。比如你想创建的向量是等差数列，就可以用到seq函数；如果你想创建从a到b的连续整数，使用a:b就可搞定；如果你想从一串数字中随机抽取几个数，使用sample函数就可完美实现；如果你再会玩点，用上字符串的粘贴功能函数paste, 你就可以把字符和数字有规律的组合起来（比如，当你想批量命名变量时），下面就是一些实现效果：

vector(mode = "character", length = 3)

## [1] "" "" ""

v <- c(0, 1, -2)

typeof(v)

## [1] "double"

class(v)

## [1] "numeric"

length(v)

## [1] 3

a <- 1:10
a

##  [1]  1  2  3  4  5  6  7  8  9 10

b <- seq(1,10, by = 2)
b

## [1] 1 3 5 7 9

c <- c('a', 'b', 'c')
c

## [1] "a" "b" "c"

d <- sample(a, size = 20, replace = T)
d

##  [1]  8  8  3  2  1  8  8  1  7  4  7  2  4  3  5  5  6 10  8  6

e <- paste("x_", 1:5, collapse = ",")
e

## [1] "x_ 1,x_ 2,x_ 3,x_ 4,x_ 5"

4.2.2 向量 - 类型转换

as.character(v)

## [1] "0"  "1"  "-2"

as.logical(v)

## [1] FALSE  TRUE  TRUE

class(v)

## [1] "numeric"

class(as.logical(v))

## [1] "logical"

4.2.3 向量 - 索引

v[2]

## [1] 1

v[c(2, 3)]

## [1]  1 -2

v[2:3]

## [1]  1 -2

v[-2]

## [1]  0 -2

which(v==1)

## [1] 2

v[which(v < 1)]

## [1]  0 -2

4.2.4 向量 - 元素命名

names(v)

## NULL

names(v) <- c("Davide", "Linda", "Smale")
names(v)

## [1] "Davide" "Linda"  "Smale"

v[c("Davide", "Linda")]

## Davide  Linda 
##      0      1

4.2.5 向量 - 运算

t <- c(2, 1, 3)

names(v) <- c("Davide", "Linda", "Smale")
(v + t)/2

## Davide  Linda  Smale 
##    1.0    1.0    0.5

t > v

## Davide  Linda  Smale 
##   TRUE  FALSE   TRUE

summary(v)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0000 -1.0000  0.0000 -0.3333  0.5000  1.0000

4.2.6 向量 - 函数

命令	描述
`sum(v)`	对 `v`的所有元素求和
`mean(v)`	求 `v`的均值
`median(v)`	求 `v`的中位数
`min(v), max(v)`	求`v`的最值
`sd(v), var(v)`	求`v`的标准差和方差
`length(v)`	`v`中的元素个数
`pmax(v1, v2), pmin(v1, v2)`	示例: `pmax(quiz1, quiz2)` 返回每个学生两次测验中的较高成绩
`sort(v)`	对`v`排序
`order(v)`	返回 `v`排序后的索引
`unique(v)`	对 `v`去重
`summary(v)`	“五数”概述
`any(v)`	是否`v`中存在一个元素为TRUE
`all(v)`	是否`v`中所有元素均为TRUE
`intersect`	交集运算
`union`	并集运算
`setdiff`	差集运算

f <- c(1,2,3,2,1,1,1,2,3,4,5,1,3,1,2,3,1)
x <- letters[f]
x

##  [1] "a" "b" "c" "b" "a" "a" "a" "b" "c" "d" "e" "a" "c" "a" "b" "c" "a"

match(x, letters[1:5])

##  [1] 1 2 3 2 1 1 1 2 3 4 5 1 3 1 2 3 1

match(x, letters[1:7])

##  [1] 1 2 3 2 1 1 1 2 3 4 5 1 3 1 2 3 1

match(x, letters[1:3])

##  [1]  1  2  3  2  1  1  1  2  3 NA NA  1  3  1  2  3  1

4.2.7 向量-字符向量

字符这种变量类型不同于数值，它有很多独特的特征，有很多特别的小脾气需要些专用函数才能收好。先拿单独一个字符对象来说就有很多独特性，比如我们通常不太会对一个数值求它的长度，但对一个字符，我们很可能就需要了解它的长度，比如下面

length("欢迎关注狗熊会")

## [1] 1

nchar("欢迎关注狗熊会")

## [1] 56

nchar("Welcome to follow the Clubear")

## [1] 29

既然有了长度，我们就可以对字符进行切分，提取出一个子字符串，这时候会使用到的函数是substr，具体用法是substr(“char”, begin_position, end_position)，比如下面的示例：

substr("Welcome to follow the Clubear", 3,9)

## [1] "lcome t"

paste(1:4)

## [1] "1" "2" "3" "4"

paste(1:4, collapse = "")

## [1] "1234"

paste(1:4, sep = "")

## [1] "1" "2" "3" "4"

paste(1:4, collapse = ",")

## [1] "1,2,3,4"

paste(1:4, sep = ",")

## [1] "1" "2" "3" "4"

paste(letters[1:4], 1:4)

## [1] "a 1" "b 2" "c 3" "d 4"

paste(letters[1:4], 1:4, collapse = "_")

## [1] "a 1_b 2_c 3_d 4"

paste(letters[1:4], 1:4, sep = "_")

## [1] "a_1" "b_2" "c_3" "d_4"

paste(letters[1:4], 1:4, collapse = "_",sep = "|")

## [1] "a|1_b|2_c|3_d|4"

t <- c("狗熊会", "Clubear", "Social Science")
grep("Social", t)

## [1] 3

# 求工资的均值
salary <- c("10w", "15w", "23w", "28w")

mean(salary)

## Warning in mean.default(salary): argument is not numeric or logical:
## returning NA

## [1] NA

提示使用gsub函数

4.3 Matrix

4.3.1 矩阵 - 创建

M <- matrix(1:9, nrow = 3, ncol = 3)
print(M)

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

length(M)

## [1] 9

dim(M)

## [1] 3 3

cbind(1:3, 4:6)

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

rbind(1:3, 4:6)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

attributes(M)

## $dim
## [1] 3 3

I <- diag(1:4)
I

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    2    0    0
## [3,]    0    0    3    0
## [4,]    0    0    0    4

vec <- 1:9
vec

## [1] 1 2 3 4 5 6 7 8 9

attributes(vec)

## NULL

dim(vec) <- c(3, 3)

vec

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

4.3.2 矩阵 - 索引

M[2,3]

## [1] 8

M[1, ]

## [1] 1 4 7

M[, 2]

## [1] 4 5 6

4.3.3 矩阵 - 运算

Q <- matrix(rep(10, 9), 3, 3)

M * Q

##      [,1] [,2] [,3]
## [1,]   10   40   70
## [2,]   20   50   80
## [3,]   30   60   90

M %*% Q

##      [,1] [,2] [,3]
## [1,]  120  120  120
## [2,]  150  150  150
## [3,]  180  180  180

4.3.4 矩阵 - 函数

colMeans(M)

## [1] 2 5 8

colSums(M)

## [1]  6 15 24

colnames(M)

## NULL

rownames(M)

## NULL

4.3.5 稀疏矩阵

# install.packages(Matrix)
library(Matrix)
vectors <- c(1:4, rep(0,4), 5,0,0,9)
m <- matrix(vectors, nrow = 4)
print(m)

##      [,1] [,2] [,3]
## [1,]    1    0    5
## [2,]    2    0    0
## [3,]    3    0    0
## [4,]    4    0    9

sm <- Matrix(vectors,nrow = 4, sparse = TRUE)
print(sm)

## 4 x 3 sparse Matrix of class "dgCMatrix"
##           
## [1,] 1 . 5
## [2,] 2 . .
## [3,] 3 . .
## [4,] 4 . 9

summary(sm)

## 4 x 3 sparse Matrix of class "dgCMatrix", with 6 entries 
##   i j x
## 1 1 1 1
## 2 2 1 2
## 3 3 1 3
## 4 4 1 4
## 5 1 3 5
## 6 4 3 9

sm <- spMatrix(nrow = 10,ncol = 20,i = 1:7,j = 1:7,x = 7*(1:7))
print(sm)

## 10 x 20 sparse Matrix of class "dgTMatrix"
##                                                    
##  [1,] 7  .  .  .  .  .  . . . . . . . . . . . . . .
##  [2,] . 14  .  .  .  .  . . . . . . . . . . . . . .
##  [3,] .  . 21  .  .  .  . . . . . . . . . . . . . .
##  [4,] .  .  . 28  .  .  . . . . . . . . . . . . . .
##  [5,] .  .  .  . 35  .  . . . . . . . . . . . . . .
##  [6,] .  .  .  .  . 42  . . . . . . . . . . . . . .
##  [7,] .  .  .  .  .  . 49 . . . . . . . . . . . . .
##  [8,] .  .  .  .  .  .  . . . . . . . . . . . . . .
##  [9,] .  .  .  .  .  .  . . . . . . . . . . . . . .
## [10,] .  .  .  .  .  .  . . . . . . . . . . . . . .

summary(sm)

## 10 x 20 sparse Matrix of class "dgTMatrix", with 7 entries 
##   i j  x
## 1 1 1  7
## 2 2 2 14
## 3 3 3 21
## 4 4 4 28
## 5 5 5 35
## 6 6 6 42
## 7 7 7 49

4.4 Data Frame

数据框是我们最常见的数据结构，一般来讲，我们从csv或txt文件读入时就会自动存储为数据框对象。该结构同样拥有表格状，但与矩阵所不同的是，矩阵只可以存储一种数据类型（比如，数值与字符型数据不能同时存在于矩阵中）。而实际中，我们看到的数据表格往往有很多类型，例如：字符、日期、数值包含在不同列中。想要在R中读入并表示这种数据，数据框就可以派上用场了。当然需要特别注意的是：数据框的每一列都必须还是同一种数据类型。如果不符合规定，R会在一定范围内强制转化数据类型，比如你输入的一列里既有文本又有数值，它会把该列强制转化成全是文本格式，这点大家要特别留意。

4.4.1 数据框 - 创建

nutrition <- data.frame(fruit = c("apples", "pears", "bananas"), calories = c(52, 57, 92))

head(nutrition, 3)

##     fruit calories
## 1  apples       52
## 2   pears       57
## 3 bananas       92

ncol(nutrition)

## [1] 2

nrow(nutrition)

## [1] 3

cbind(nutrition, nutrition)

##     fruit calories   fruit calories
## 1  apples       52  apples       52
## 2   pears       57   pears       57
## 3 bananas       92 bananas       92

rbind(nutrition, nutrition)

##     fruit calories
## 1  apples       52
## 2   pears       57
## 3 bananas       92
## 4  apples       52
## 5   pears       57
## 6 bananas       92

4.4.2 数据框 - 属性

names(nutrition)[1] <- "type"

attributes(nutrition)

## $names
## [1] "type"     "calories"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"

colnames(nutrition)

## [1] "type"     "calories"

4.4.3 数据框 - 概览

str(nutrition)

## 'data.frame':    3 obs. of  2 variables:
##  $ type    : Factor w/ 3 levels "apples","bananas",..: 1 3 2
##  $ calories: num  52 57 92

summary(nutrition)

##       type      calories   
##  apples :1   Min.   :52.0  
##  bananas:1   1st Qu.:54.5  
##  pears  :1   Median :57.0  
##              Mean   :67.0  
##              3rd Qu.:74.5  
##              Max.   :92.0

4.4.4 数据框 - 索引

nutrition[which(nutrition$type == "apples"), ]

##     type calories
## 1 apples       52

nutrition$calories # 返回结果为向量

## [1] 52 57 92

nutrition[["calories"]] # 返回结果为向量

## [1] 52 57 92

nutrition["calories"] # 返回结果为数据框

##   calories
## 1       52
## 2       57
## 3       92

4.4.5 数据框 - 函数

calorie <- nutrition[["calories"]]

max(calorie)

## [1] 92

选出calories大于80的水果

##      type
## 3 bananas

提示使用函数subset

row.names(nutrition)

## [1] "1" "2" "3"

rownames(nutrition)

## [1] "1" "2" "3"

4.5 List

4.5.1 列表 - 创建

list(
  c("john", "dave", "linda", "kino"), 
  c(183, 168, 177, 167), 
  c("M", "F", "M"))

## [[1]]
## [1] "john"  "dave"  "linda" "kino" 
## 
## [[2]]
## [1] 183 168 177 167
## 
## [[3]]
## [1] "M" "F" "M"

L <- list(
  name = c("john", "dave", "linda", "kino"), 
  height = c(183, 168, 177, 167), 
  gender = c("M", "F", "M"))

L

## $name
## [1] "john"  "dave"  "linda" "kino" 
## 
## $height
## [1] 183 168 177 167
## 
## $gender
## [1] "M" "F" "M"

str(L)

## List of 3
##  $ name  : chr [1:4] "john" "dave" "linda" "kino"
##  $ height: num [1:4] 183 168 177 167
##  $ gender: chr [1:3] "M" "F" "M"

Doraemon.bag <- list(
  char = "K",
  num = c(2, 2, 6),
  mat = matrix(6, 2, 2),
  df = data.frame(A = 1:3, I = 4:6),
  list = list(D = 1, M = 9)
)

str(Doraemon.bag)

## List of 5
##  $ char: chr "K"
##  $ num : num [1:3] 2 2 6
##  $ mat : num [1:2, 1:2] 6 6 6 6
##  $ df  :'data.frame':    3 obs. of  2 variables:
##   ..$ A: int [1:3] 1 2 3
##   ..$ I: int [1:3] 4 5 6
##  $ list:List of 2
##   ..$ D: num 1
##   ..$ M: num 9

4.5.2 列表 - 索引

L <- list(
  name = c("john", "dave", "linda", "kino"), 
  height = c(183, 168, 177, 167), 
  gender = c("M", "F", "M"))
L$name

## [1] "john"  "dave"  "linda" "kino"

L[["name"]]

## [1] "john"  "dave"  "linda" "kino"

L["name"]

## $name
## [1] "john"  "dave"  "linda" "kino"

c(is.vector(L$name), is.vector(L[["name"]]),   
  is.list(L["name"]))

## [1] TRUE TRUE TRUE

4.6 Factor

4.6.1 因子 - 创建与属性

f <- factor(c("yes", "yes", "no", "no", "yes"))

f

## [1] yes yes no  no  yes
## Levels: no yes

attributes(f)

## $levels
## [1] "no"  "yes"
## 
## $class
## [1] "factor"

f <- factor(
  c("yes", "yes", "no", "no", "yes"),    
  levels = c("yes", "no"))

f

## [1] yes yes no  no  yes
## Levels: yes no

attr(f, "levels") <- c("y", "n")

unclass(f)

## [1] 1 1 2 2 1
## attr(,"levels")
## [1] "y" "n"

4.6.2 因子 - cut

age <- sample(21:99, size = 50, replace = T)
age

##  [1] 66 91 75 50 90 66 52 85 90 72 35 66 50 68 91 55 74 40 21 64 37 45 53
## [24] 96 26 66 24 57 27 64 40 29 99 60 85 97 32 33 76 53 65 30 63 80 27 54
## [47] 82 30 58 44

age_label <- cut(age, breaks = c(20,45,75,100), labels = c("yong","middle","old"))
age_label

##  [1] middle old    middle middle old    middle middle old    old    middle
## [11] yong   middle middle middle old    middle middle yong   yong   middle
## [21] yong   yong   middle old    yong   middle yong   middle yong   middle
## [31] yong   yong   old    middle old    old    yong   yong   old    middle
## [41] middle yong   middle old    yong   middle old    yong   middle yong  
## Levels: yong middle old

4.7 Date

# 想知道今天日期
today <- Sys.Date()
# ISO 8601的标准yyyy-mm-dd
print(today)

## [1] "2018-07-10"

typeof(today)

## [1] "double"

class(today)

## [1] "Date"

# 字符串转化为时间格式
secondDay <- as.Date("2018-07-10")
print(secondDay)

## [1] "2018-07-10"

# 如果我们收集的数据里面的时间格式是07|10|2018如何处理？
thirdDay <- c("07|11|2018")
class(thirdDay)

## [1] "character"

print(thirdDay)

## [1] "07|11|2018"

thirdDay <- as.Date(thirdDay, format = "%m|%d|%y")
class(thirdDay)

## [1] "Date"

print(thirdDay)

## [1] "2020-07-11"

#如果我们收集的数据里面的时间格式是07|10|2018如何处理?

`format`参数	描述
`%b`	缩写月份Jan
`%B`	完整月份名称
`%d`	两位数的日期
`%m`	两位数的月份
`%y`	没有世纪的年份
`%Y`	有世纪的年份

today <- ISOdate(year = 2018,month = 07,day = 10)
print(today)

## [1] "2018-07-10 12:00:00 GMT"

class(today)

## [1] "POSIXct" "POSIXt"

today <- as.Date(today)
class(today)

## [1] "Date"

y <- 2015:2018
m <- 5:8
d <- 5:8
ISOdate(y,m,d)

## [1] "2015-05-05 12:00:00 GMT" "2016-06-06 12:00:00 GMT"
## [3] "2017-07-07 12:00:00 GMT" "2018-08-08 12:00:00 GMT"

# ISOdatetime可以扩展到“时，分，秒”

#时间是相对概念，所以R中选取的标准时间是1970-01-01
as.integer(today)

## [1] 17722

julian(today)

## [1] 17722
## attr(,"origin")
## [1] "1970-01-01"

as.integer(as.Date("1970-01-02"))

## [1] 1

# 提取日期一部分方法：把Date的对象转化为POSIXlt的对象。它是一个日期各个组成元素的一个列表。

today_list <- as.POSIXlt(today)
print(today_list)

## [1] "2018-07-10 UTC"

class(today_list)

## [1] "POSIXlt" "POSIXt"

today_list$mday

## [1] 10

today_list$mon # January = 0

## [1] 6

today_list$year

## [1] 118

today_list$year + 1900

## [1] 2018

`POSIXlt`列表的参数	描述
`sec`	秒数
`min`	分钟数
`hour`	小时数
`mday`	该月天数
`mon`	月份
`year`	自1990年起的年份
`wday`	该周的某一天
`yday`	该年的某一天
`isdst`	夏令时标记

# 创建日期序列
f <- as.Date("2018-07-01")
t <- as.Date("2018-07-10")
seq(from = f, to = t, by = 1)

##  [1] "2018-07-01" "2018-07-02" "2018-07-03" "2018-07-04" "2018-07-05"
##  [6] "2018-07-06" "2018-07-07" "2018-07-08" "2018-07-09" "2018-07-10"

# 日期的相关计算
#做差
#求任意两个日期距离的天数
begin=as.Date("2016-03-04")
end=as.Date("2016-05-08")
(during=end-begin)

## Time difference of 65 days

#求任意两个日期距离的周数和小时数
difftime(end,begin,units = "weeks")

## Time difference of 9.285714 weeks

difftime(end,begin,units = "hours")

## Time difference of 1560 hours

以上介绍的都是base基础包里自带的函数，下面要为大家介绍一款专门高效处理时间数据各种疑难杂症的包lubridate. 这是一个民间口碑极佳的数据预处理包包，曾屡次获得“能大幅提高数据预处理效率的精品”，“最让人感觉相见恨晚的包包”等美誉，足见它的群众基础多么深厚！lubridate包主要有两类函数，一类是处理时点数据，另一类是处理时段数据，不仅功能强大，而且相应函数也很直观易懂，比如把字符转化成时间类型，根本不需要输入匹配的format，比如提取时间数据细节，也只是一个小函数即可完美搞定，不附带任何参数。

## [1] "2009-01-01" "2009-01-02" "2009-01-03" "2009-01-04" "2009-01-05"
## [6] "2009-01-06" "2009-01-07"

## [1] 20

## [1] 6

## [1] 1

## [1] 30

4.8 Missing Value

q <- c(1, NaN, NA, 4)

class(q)

## [1] "numeric"

is.nan(q)

## [1] FALSE  TRUE FALSE FALSE

is.na(q)

## [1] FALSE  TRUE  TRUE FALSE

mean(q)

## [1] NaN

mean(q, na.rm = T)

## [1] 2.5

is.na(NA)
is.na(NaN)
is.nan(NaN)
is.na(NA)

5 R中的数据读写

读取csv

data <- read.csv("./data/data.csv", header = TRUE) 

data <- read.table("./data/data.txt", header = TRUE)

写入csv

write.table(data, file = "./data/dataout.txt")
write.csv(data, file = "./data/dataout.csv")

6 R中控制流

if, else
for and something new
while
repeat
break
next

6.1 if, else

s <- 1:5
ifelse(s > 3, yes = "Good", no = "Fair")

## [1] "Fair" "Fair" "Fair" "Good" "Good"

a <- 1

if(a > 2){
  b <- 1
}else{
  b <- 0
}

## [1] 0

a <- -1

if(a > 2){
  b <- 1
}else if(a < 0){
  b <- 1
}else{
  b <- 0
}

## [1] 1

6.2 for

for(i in 1:3){
  print(i) 
}

## [1] 1
## [1] 2
## [1] 3

M <- matrix(1:2, 2, 3)

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2

对列求和

numeric(3)

## [1] 0 0 0

col.sum <- numeric(ncol(M))
for(j in 1:ncol(M)){
  for(i in 1:nrow(M)){
    col.sum[j] <- col.sum[j] + M[i, j]
  }
}

col.sum

## [1] 3 3 3

对列求和

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2

向量化运算

col.sum

## [1] 3 3 3

6.3 while

i <- 0
while(i < 3){
  print(i)
  i <- i + 1
}

## [1] 0
## [1] 1
## [1] 2

6.4 repeat and break

i <- 0
repeat{
  print(i)
  ifelse(i > 1, yes = break, no = i <- i + 1)
}

## [1] 0
## [1] 1
## [1] 2

6.5 next

for(i in 1:10){
  if(i <= 7){ 
    next
  }
  print(i)
  i <- i + 1
}

## [1] 8
## [1] 9
## [1] 10

6.6 Apply系列

print(M)

##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    2    2

apply(M, 2, sum)

## [1] 3 3 3

apply(M, 1, sum)

## [1] 3 6

6.6.1 lapply and sapply

L <- list(
  a = matrix(1:3, 1), 
  b = data.frame(
    char = c("A", "B"), 
    logic = c(T, F)), 
  c = list(
    1:3, 
    "T",
    matrix(1:6, 2, 3)))
L

## $a
##      [,1] [,2] [,3]
## [1,]    1    2    3
## 
## $b
##   char logic
## 1    A  TRUE
## 2    B FALSE
## 
## $c
## $c[[1]]
## [1] 1 2 3
## 
## $c[[2]]
## [1] "T"
## 
## $c[[3]]
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

lapply(L, class)

## $a
## [1] "matrix"
## 
## $b
## [1] "data.frame"
## 
## $c
## [1] "list"

sapply(L, class)

##            a            b            c 
##     "matrix" "data.frame"       "list"

6.6.2 mapply

mapply(rep, 1:6, 6:1)

## [[1]]
## [1] 1 1 1 1 1 1
## 
## [[2]]
## [1] 2 2 2 2 2
## 
## [[3]]
## [1] 3 3 3 3
## 
## [[4]]
## [1] 4 4 4
## 
## [[5]]
## [1] 5 5
## 
## [[6]]
## [1] 6

6.6.3 tapply

set.seed(6)
n <- 9
bike <- data.frame(
  orgn = sample(c("SE", "S", "N"), n, replace = T),
  dest = sample(c("SE", "S", "N"), n, replace = T),
  flux = round(rnorm(n, 18)))

bike

##   orgn dest flux
## 1    S   SE   17
## 2    N    S   20
## 3   SE    N   17
## 4    S   SE   19
## 5    N   SE   18
## 6    N    N   17
## 7    N   SE   18
## 8    N    S   20
## 9    S    N   17

tapply(X = bike$flux, INDEX = bike$orgn, FUN = sum)

##  N  S SE 
## 93 53 17

##     
##      N S SE
##   N  1 2  2
##   S  1 0  2
##   SE 1 0  0

##     dest
## orgn  N  S SE
##   N  17 20 18
##   S  17 NA 18
##   SE 17 NA NA

6.6.4 小结：apply函数族

命令	描述
`apply(X, MARGIN, FUN)`	通过对一个数组或矩阵`X`的指定`MARGIN`应用`FUN`来得到一个向量/数组/列表
`lapply(X, FUN)`	通过对一个列表`X`的元素应用`FUN` 来得到一个列表
`sapply(X, FUN)`	`lapply`的简化版本返回一个向量/数组而不是列表
`mapply(FUN)`	`lapply`的多变元版本
`tapply(X, INDEX, FUN)`	通过对`INDEX`制定的因子组合应用`FUN`函数来得到一个表格

7 R中的函数

定义
参数
封装

7.1 Definition

rsummary <- function(x) {
  return(
    list(
    mean = mean(x), 
    sd = sd(x),
    quantiles = quantile(x)))
}

x <- rnorm(100, mean = 6, sd = 1)

rsummary(x)

## $mean
## [1] 5.928524
## 
## $sd
## [1] 0.990941
## 
## $quantiles
##       0%      25%      50%      75%     100% 
## 4.047651 5.268584 5.818003 6.522913 8.608098

calGrade <- function(x, base = 59) {
  s <- x + base
  if(s >= 90) {
    grade <- "A"
  } else if(s >= 80) {
    grade <- "B"
  } else if(s >= 70) {
    grade <- "C"
  } else if(x >= 60){
    grade <- "D"
  }else{
    grade <- "Failed"
  }
  grade
}

scores <- c(40, 20, 27, 31, 0, 0)

calGrade(scores)

## [1] "A"

lapply(scores, calGrade)

## [[1]]
## [1] "A"
## 
## [[2]]
## [1] "C"
## 
## [[3]]
## [1] "B"
## 
## [[4]]
## [1] "A"
## 
## [[5]]
## [1] "Failed"
## 
## [[6]]
## [1] "Failed"

return

square <- function(x) {
  return(x^2)
}

square(11)

## [1] 121

square <- function(x) {
  x^2
}

square(11)

## [1] 121

7.2 Parameter

formals

formals(calGrade)

## $x
## 
## 
## $base
## [1] 59

formals(plot)

## $x
## 
## 
## $y
## 
## 
## $...

可以缺失，可能有缺省值
参数匹配

(M <- matrix(1:6, nrow = 2)) # 精确匹配

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

(M <- matrix(1:6, nr = 2)) # 部分匹配

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

(M <- matrix(1:6, 2)) # 位置匹配

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

quantile(x, probs = seq(0, 1, 0.25), ...)
apply(X, MARGIN, FUN, ...)

set.seed(1)
(x <- matrix(rnorm(12), 3))

##            [,1]       [,2]      [,3]       [,4]
## [1,] -0.6264538  1.5952808 0.4874291 -0.3053884
## [2,]  0.1836433  0.3295078 0.7383247  1.5117812
## [3,] -0.8356286 -0.8204684 0.5757814  0.3898432

apply(x, 2, quantile)  # matrix

##            [,1]       [,2]      [,3]        [,4]
## 0%   -0.8356286 -0.8204684 0.4874291 -0.30538839
## 25%  -0.7310412 -0.2454803 0.5316052  0.04222742
## 50%  -0.6264538  0.3295078 0.5757814  0.38984324
## 75%  -0.2214052  0.9623943 0.6570530  0.95081220
## 100%  0.1836433  1.5952808 0.7383247  1.51178117

quantile(x, probs = seq(0, 1, 0.25), ...)
apply(X, MARGIN, FUN, ...)

set.seed(1)
(x <- matrix(rnorm(12), 3))

##            [,1]       [,2]      [,3]       [,4]
## [1,] -0.6264538  1.5952808 0.4874291 -0.3053884
## [2,]  0.1836433  0.3295078 0.7383247  1.5117812
## [3,] -0.8356286 -0.8204684 0.5757814  0.3898432

apply(x, 2, quantile, probs = c(.25, .75))

##           [,1]       [,2]      [,3]       [,4]
## 25% -0.7310412 -0.2454803 0.5316052 0.04222742
## 75% -0.2214052  0.9623943 0.6570530 0.95081220

str(paste)

## function (..., sep = " ", collapse = NULL)

GOOD

paste("Hi", " There", sep = "!")

## [1] "Hi! There"

BAD

paste("Hi", " There", se = "!")

## [1] "Hi  There !"

x <- 1:30
plot(x, log(x))

自定义函数

my.plot <- 
  function(x, y, type = "l", ...){
     plot(x, y, type = type, ...)
}

my.plot(x, log(x))

7.3 Encapsulation

Obsessive Compulsive Disorder, OCD

主文件调用函数文件

source("./fn/MyPlotEncap.R")  
my.plot.encap(x, log(x))

8 作业

第一部分：swirl包是一款专门的进行R语言课后练习的包，你可以按照包的提示去一点点学习R语言。本门课的作业部分，会有部分内容的作业是由这个包来完成。这个包的详细介绍请戳这里。整个包的代码库请戳这里。请按照下面操作，完成本作业。

knitr::include_graphics('fig/swirl.png')

首先，安装swirl包如下：

install.packages("swirl")

其次，启动swirl包如下：

# 载入swirl
library("swirl")

# 启动swirl
swirl()

这里会出现如下界面：

输入swirl()后，会继续出现

输入你自己的名字，然后会进入课程选择好了，第二天作业，大家能完成1-5.

第二部分：

观看狗熊会视频微课中的R语言入门视频，学习如何导入数据到R。

下载Speed Dating的数据，数据描述与文章

建立一个R Markdown文档

读入Speed Dating的数据

修改gender为因子类型

计算试验男女比例

计算男性在约会前对于自己心仪对象的各个维度评分的均值（attr1_1“,”sinc1_1“,”intel1_1“,”fun1_1“,”amb1_1“,”shar1_1“）

计算女性约会前对于自己心仪对象的各个维度评分均值(attr1_1“,”sinc1_1“,”intel1_1“,”fun1_1“,”amb1_1“,”shar1_1)

你有什么结论吗？

思考还有什么问题可以研究？

R语言及其在社会分析中的应用

常象宇

Update 07/10/2018

1 R语言基础运算

1.1 Basic Operation

1.2 Logical Operation

1.3 Assignment Operation

1.4 Examples

2 R中的代码规范

3 R中的数据类型

3.1 Basic

3.2 Examples

4 R中的数据结构

4.1 Basic

4.2 Vector

4.2.1 向量 - 创建

4.2.2 向量 - 类型转换

4.2.3 向量 - 索引

4.2.4 向量 - 元素命名

4.2.5 向量 - 运算

4.2.6 向量 - 函数

4.2.7 向量-字符向量

4.3 Matrix

4.3.1 矩阵 - 创建

4.3.2 矩阵 - 索引

4.3.3 矩阵 - 运算

4.3.4 矩阵 - 函数

4.3.5 稀疏矩阵

4.4 Data Frame

4.4.1 数据框 - 创建

4.4.2 数据框 - 属性

4.4.3 数据框 - 概览

4.4.4 数据框 - 索引

4.4.5 数据框 - 函数

4.5 List

4.5.1 列表 - 创建

4.5.2 列表 - 索引

4.6 Factor

4.6.1 因子 - 创建与属性

4.6.2 因子 - cut

4.7 Date

4.8 Missing Value

5 R中的数据读写

6 R中控制流

6.1 if, else

6.2 for

6.3 while

6.4 repeat and break

6.5 next

6.6 Apply系列

6.6.1 lapply and sapply

6.6.2 mapply

6.6.3 tapply

6.6.4 小结：apply函数族

7 R中的函数

7.1 Definition

7.2 Parameter

7.3 Encapsulation

8 作业