数据科学的对象主体肯定是数据。在这个大数据时代,任何可以被计算机所记录的都是数据。数据本身是具有数字属性的,那么它们就可以进行基本运算。R语言最初的设计的最基本单元应该是基础运算,大家可以把其想象成计算器。
命令 | 描述 |
---|---|
+,-,*,\ |
加, 减, 乘, 除 |
^ |
取幂 |
%% |
取余 |
round() |
四舍五入 |
floor() |
向下取整 |
ceiling() |
向上取整 |
abs() |
绝对值 |
sqrt() |
平方根 |
log() |
对数 |
exp() |
指数 |
( ) |
改变运算优先级 |
命令 | 描述 |
---|---|
&,|,! |
与,或,非 |
== |
判断是否相等 |
命令 | 描述 |
---|---|
= |
单向赋值 |
<- |
双向赋值 |
<<- |
给上一层环境变量赋值 |
22 %% 6
## [1] 4
floor(22/6)
## [1] 3
sqrt(22/6)
## [1] 1.914854
log(22/6)
## [1] 1.299283
3>4
## [1] FALSE
F == T
## [1] FALSE
尝试用4个1计算出24
<-
双向赋值
asg <- 2
asg
## [1] 2
3 -> asg
asg
## [1] 3
=
单向传值asg = 1
asg
## [1] 1
# 这是错误的写法
1 = asg
=
函数参数传值
Usage
matrix(data = NA, nrow = 1, ncol = 1, ...)
matrix(1:6, 3)
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
如何生成一个
2 × 3的矩阵?
GOOD
matrix(1:6, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
BAD
matrix(1:6, ncol <- 3)
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
BAD函数参数传值
ls() # 显示什么?
BAD函数参数传值
mean(x = 1:3)
## [1] 2
ls()
## [1] "asg" "ncol"
mean(x <- 1:3)
## [1] 2
ls()
## [1] "asg" "ncol" "x"
> 除函数参数传值外, =
可被替换为 <-
, 反之则不然, 如: system.time(A <- matrix(1:6, 3))
中只能用 <-
<<-
表示给上一层环境中的变量赋值
建议:进行赋值运算时,使用 <-
而非 =
.
阅读狗熊会的人才计划推文:代码的规范。
原子型对象
逻辑型:logical
字符型:character
整数型:integer
实数型:numeric (double)
复数型:complex
is.logical(T)
## [1] TRUE
is.logical(1)
## [1] FALSE
typeof(7)
## [1] "double"
is.numeric(7)
## [1] TRUE
is.integer(7)
## [1] FALSE
is.character("K")
## [1] TRUE
is.character(as.character("1"))
## [1] TRUE
非原子型对象
向量:Vectors
矩阵:Matrices
数据框:Data Frames
列表:Lists
因子:Factors
时间序列:Date
向量是什么?向量其实就是用于存储同一种类型数据的一维数组,它是所有数据结构中最基础最基本的形式。它的存储方式就像是一个房子下面的许多小房间,每一间存储着同种类型的元素。下面我们就通过向量的“基本操作”和“常见类型”两部分为大家详细讲解。
一般来说,采用函数c()
即可完成向量的创建,只要在括号中输入每个向量元素就好啦!有同学要抗议了,这么做多费事,有没有快捷简单点的方法?捷径在于规律。
如果你知道你的向量是以什么规律排列,你就有可能找到专为它量身打造的函数。比如你想创建的向量是等差数列,就可以用到seq
函数;如果你想创建从a到b的连续整数,使用a:b
就可搞定;如果你想从一串数字中随机抽取几个数,使用sample
函数就可完美实现;如果你再会玩点,用上字符串的粘贴功能函数paste
, 你就可以把字符和数字有规律的组合起来(比如,当你想批量命名变量时),下面就是一些实现效果:
vector(mode = "character", length = 3)
## [1] "" "" ""
v <- c(0, 1, -2)
typeof(v)
## [1] "double"
class(v)
## [1] "numeric"
length(v)
## [1] 3
a <- 1:10
a
## [1] 1 2 3 4 5 6 7 8 9 10
b <- seq(1,10, by = 2)
b
## [1] 1 3 5 7 9
c <- c('a', 'b', 'c')
c
## [1] "a" "b" "c"
d <- sample(a, size = 20, replace = T)
d
## [1] 8 8 3 2 1 8 8 1 7 4 7 2 4 3 5 5 6 10 8 6
e <- paste("x_", 1:5, collapse = ",")
e
## [1] "x_ 1,x_ 2,x_ 3,x_ 4,x_ 5"
as.character(v)
## [1] "0" "1" "-2"
as.logical(v)
## [1] FALSE TRUE TRUE
class(v)
## [1] "numeric"
class(as.logical(v))
## [1] "logical"
v[2]
## [1] 1
v[c(2, 3)]
## [1] 1 -2
v[2:3]
## [1] 1 -2
v[-2]
## [1] 0 -2
which(v==1)
## [1] 2
v[which(v < 1)]
## [1] 0 -2
names(v)
## NULL
names(v) <- c("Davide", "Linda", "Smale")
names(v)
## [1] "Davide" "Linda" "Smale"
v[c("Davide", "Linda")]
## Davide Linda
## 0 1
t <- c(2, 1, 3)
names(v) <- c("Davide", "Linda", "Smale")
(v + t)/2
## Davide Linda Smale
## 1.0 1.0 0.5
t > v
## Davide Linda Smale
## TRUE FALSE TRUE
summary(v)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.0000 -1.0000 0.0000 -0.3333 0.5000 1.0000
命令 | 描述 |
---|---|
sum(v) |
对 v 的所有元素求和 |
mean(v) |
求 v 的均值 |
median(v) |
求 v 的中位数 |
min(v), max(v) |
求v 的最值 |
sd(v), var(v) |
求v 的标准差和方差 |
length(v) |
v 中的元素个数 |
pmax(v1, v2), pmin(v1, v2) |
示例: pmax(quiz1, quiz2) 返回每个学生两次测验中的较高成绩 |
sort(v) |
对v 排序 |
order(v) |
返回 v 排序后的索引 |
unique(v) |
对 v 去重 |
summary(v) |
“五数”概述 |
any(v) |
是否v 中存在一个元素为TRUE |
all(v) |
是否v 中所有元素均为TRUE |
intersect |
交集运算 |
union |
并集运算 |
setdiff |
差集运算 |
f <- c(1,2,3,2,1,1,1,2,3,4,5,1,3,1,2,3,1)
x <- letters[f]
x
## [1] "a" "b" "c" "b" "a" "a" "a" "b" "c" "d" "e" "a" "c" "a" "b" "c" "a"
match(x, letters[1:5])
## [1] 1 2 3 2 1 1 1 2 3 4 5 1 3 1 2 3 1
match(x, letters[1:7])
## [1] 1 2 3 2 1 1 1 2 3 4 5 1 3 1 2 3 1
match(x, letters[1:3])
## [1] 1 2 3 2 1 1 1 2 3 NA NA 1 3 1 2 3 1
字符这种变量类型不同于数值,它有很多独特的特征,有很多特别的小脾气需要些专用函数才能收好。先拿单独一个字符对象来说就有很多独特性,比如我们通常不太会对一个数值求它的长度,但对一个字符,我们很可能就需要了解它的长度,比如下面
length("欢迎关注狗熊会")
## [1] 1
nchar("欢迎关注狗熊会")
## [1] 56
nchar("Welcome to follow the Clubear")
## [1] 29
既然有了长度,我们就可以对字符进行切分,提取出一个子字符串,这时候会使用到的函数是substr
,具体用法是substr(“char”, begin_position, end_position)
,比如下面的示例:
substr("Welcome to follow the Clubear", 3,9)
## [1] "lcome t"
paste(1:4)
## [1] "1" "2" "3" "4"
paste(1:4, collapse = "")
## [1] "1234"
paste(1:4, sep = "")
## [1] "1" "2" "3" "4"
paste(1:4, collapse = ",")
## [1] "1,2,3,4"
paste(1:4, sep = ",")
## [1] "1" "2" "3" "4"
paste(letters[1:4], 1:4)
## [1] "a 1" "b 2" "c 3" "d 4"
paste(letters[1:4], 1:4, collapse = "_")
## [1] "a 1_b 2_c 3_d 4"
paste(letters[1:4], 1:4, sep = "_")
## [1] "a_1" "b_2" "c_3" "d_4"
paste(letters[1:4], 1:4, collapse = "_",sep = "|")
## [1] "a|1_b|2_c|3_d|4"
t <- c("狗熊会", "Clubear", "Social Science")
grep("Social", t)
## [1] 3
# 求工资的均值
salary <- c("10w", "15w", "23w", "28w")
mean(salary)
## Warning in mean.default(salary): argument is not numeric or logical:
## returning NA
## [1] NA
gsub
函数M <- matrix(1:9, nrow = 3, ncol = 3)
print(M)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
length(M)
## [1] 9
dim(M)
## [1] 3 3
cbind(1:3, 4:6)
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
rbind(1:3, 4:6)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
attributes(M)
## $dim
## [1] 3 3
I <- diag(1:4)
I
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 2 0 0
## [3,] 0 0 3 0
## [4,] 0 0 0 4
vec <- 1:9
vec
## [1] 1 2 3 4 5 6 7 8 9
attributes(vec)
## NULL
dim(vec) <- c(3, 3)
vec
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
M[2,3]
## [1] 8
M[1, ]
## [1] 1 4 7
M[, 2]
## [1] 4 5 6
Q <- matrix(rep(10, 9), 3, 3)
M * Q
## [,1] [,2] [,3]
## [1,] 10 40 70
## [2,] 20 50 80
## [3,] 30 60 90
M %*% Q
## [,1] [,2] [,3]
## [1,] 120 120 120
## [2,] 150 150 150
## [3,] 180 180 180
colMeans(M)
## [1] 2 5 8
colSums(M)
## [1] 6 15 24
colnames(M)
## NULL
rownames(M)
## NULL
# install.packages(Matrix)
library(Matrix)
vectors <- c(1:4, rep(0,4), 5,0,0,9)
m <- matrix(vectors, nrow = 4)
print(m)
## [,1] [,2] [,3]
## [1,] 1 0 5
## [2,] 2 0 0
## [3,] 3 0 0
## [4,] 4 0 9
sm <- Matrix(vectors,nrow = 4, sparse = TRUE)
print(sm)
## 4 x 3 sparse Matrix of class "dgCMatrix"
##
## [1,] 1 . 5
## [2,] 2 . .
## [3,] 3 . .
## [4,] 4 . 9
summary(sm)
## 4 x 3 sparse Matrix of class "dgCMatrix", with 6 entries
## i j x
## 1 1 1 1
## 2 2 1 2
## 3 3 1 3
## 4 4 1 4
## 5 1 3 5
## 6 4 3 9
sm <- spMatrix(nrow = 10,ncol = 20,i = 1:7,j = 1:7,x = 7*(1:7))
print(sm)
## 10 x 20 sparse Matrix of class "dgTMatrix"
##
## [1,] 7 . . . . . . . . . . . . . . . . . . .
## [2,] . 14 . . . . . . . . . . . . . . . . . .
## [3,] . . 21 . . . . . . . . . . . . . . . . .
## [4,] . . . 28 . . . . . . . . . . . . . . . .
## [5,] . . . . 35 . . . . . . . . . . . . . . .
## [6,] . . . . . 42 . . . . . . . . . . . . . .
## [7,] . . . . . . 49 . . . . . . . . . . . . .
## [8,] . . . . . . . . . . . . . . . . . . . .
## [9,] . . . . . . . . . . . . . . . . . . . .
## [10,] . . . . . . . . . . . . . . . . . . . .
summary(sm)
## 10 x 20 sparse Matrix of class "dgTMatrix", with 7 entries
## i j x
## 1 1 1 7
## 2 2 2 14
## 3 3 3 21
## 4 4 4 28
## 5 5 5 35
## 6 6 6 42
## 7 7 7 49
数据框是我们最常见的数据结构,一般来讲,我们从csv或txt文件读入时就会自动存储为数据框对象。该结构同样拥有表格状,但与矩阵所不同的是,矩阵只可以存储一种数据类型(比如,数值与字符型数据不能同时存在于矩阵中)。而实际中,我们看到的数据表格往往有很多类型,例如:字符、日期、数值包含在不同列中。想要在R中读入并表示这种数据,数据框就可以派上用场了。当然需要特别注意的是:数据框的每一列都必须还是同一种数据类型。如果不符合规定,R会在一定范围内强制转化数据类型,比如你输入的一列里既有文本又有数值,它会把该列强制转化成全是文本格式,这点大家要特别留意。
nutrition <- data.frame(fruit = c("apples", "pears", "bananas"), calories = c(52, 57, 92))
head(nutrition, 3)
## fruit calories
## 1 apples 52
## 2 pears 57
## 3 bananas 92
ncol(nutrition)
## [1] 2
nrow(nutrition)
## [1] 3
cbind(nutrition, nutrition)
## fruit calories fruit calories
## 1 apples 52 apples 52
## 2 pears 57 pears 57
## 3 bananas 92 bananas 92
rbind(nutrition, nutrition)
## fruit calories
## 1 apples 52
## 2 pears 57
## 3 bananas 92
## 4 apples 52
## 5 pears 57
## 6 bananas 92
names(nutrition)[1] <- "type"
attributes(nutrition)
## $names
## [1] "type" "calories"
##
## $row.names
## [1] 1 2 3
##
## $class
## [1] "data.frame"
colnames(nutrition)
## [1] "type" "calories"
str(nutrition)
## 'data.frame': 3 obs. of 2 variables:
## $ type : Factor w/ 3 levels "apples","bananas",..: 1 3 2
## $ calories: num 52 57 92
summary(nutrition)
## type calories
## apples :1 Min. :52.0
## bananas:1 1st Qu.:54.5
## pears :1 Median :57.0
## Mean :67.0
## 3rd Qu.:74.5
## Max. :92.0
nutrition[which(nutrition$type == "apples"), ]
## type calories
## 1 apples 52
nutrition$calories # 返回结果为向量
## [1] 52 57 92
nutrition[["calories"]] # 返回结果为向量
## [1] 52 57 92
nutrition["calories"] # 返回结果为数据框
## calories
## 1 52
## 2 57
## 3 92
calorie <- nutrition[["calories"]]
max(calorie)
## [1] 92
选出calories大于80的水果
## type
## 3 bananas
subset
row.names(nutrition)
## [1] "1" "2" "3"
rownames(nutrition)
## [1] "1" "2" "3"
list(
c("john", "dave", "linda", "kino"),
c(183, 168, 177, 167),
c("M", "F", "M"))
## [[1]]
## [1] "john" "dave" "linda" "kino"
##
## [[2]]
## [1] 183 168 177 167
##
## [[3]]
## [1] "M" "F" "M"
L <- list(
name = c("john", "dave", "linda", "kino"),
height = c(183, 168, 177, 167),
gender = c("M", "F", "M"))
L
## $name
## [1] "john" "dave" "linda" "kino"
##
## $height
## [1] 183 168 177 167
##
## $gender
## [1] "M" "F" "M"
str(L)
## List of 3
## $ name : chr [1:4] "john" "dave" "linda" "kino"
## $ height: num [1:4] 183 168 177 167
## $ gender: chr [1:3] "M" "F" "M"
Doraemon.bag <- list(
char = "K",
num = c(2, 2, 6),
mat = matrix(6, 2, 2),
df = data.frame(A = 1:3, I = 4:6),
list = list(D = 1, M = 9)
)
str(Doraemon.bag)
## List of 5
## $ char: chr "K"
## $ num : num [1:3] 2 2 6
## $ mat : num [1:2, 1:2] 6 6 6 6
## $ df :'data.frame': 3 obs. of 2 variables:
## ..$ A: int [1:3] 1 2 3
## ..$ I: int [1:3] 4 5 6
## $ list:List of 2
## ..$ D: num 1
## ..$ M: num 9
L <- list(
name = c("john", "dave", "linda", "kino"),
height = c(183, 168, 177, 167),
gender = c("M", "F", "M"))
L$name
## [1] "john" "dave" "linda" "kino"
L[["name"]]
## [1] "john" "dave" "linda" "kino"
L["name"]
## $name
## [1] "john" "dave" "linda" "kino"
c(is.vector(L$name), is.vector(L[["name"]]),
is.list(L["name"]))
## [1] TRUE TRUE TRUE
f <- factor(c("yes", "yes", "no", "no", "yes"))
f
## [1] yes yes no no yes
## Levels: no yes
attributes(f)
## $levels
## [1] "no" "yes"
##
## $class
## [1] "factor"
f <- factor(
c("yes", "yes", "no", "no", "yes"),
levels = c("yes", "no"))
f
## [1] yes yes no no yes
## Levels: yes no
attr(f, "levels") <- c("y", "n")
unclass(f)
## [1] 1 1 2 2 1
## attr(,"levels")
## [1] "y" "n"
age <- sample(21:99, size = 50, replace = T)
age
## [1] 66 91 75 50 90 66 52 85 90 72 35 66 50 68 91 55 74 40 21 64 37 45 53
## [24] 96 26 66 24 57 27 64 40 29 99 60 85 97 32 33 76 53 65 30 63 80 27 54
## [47] 82 30 58 44
age_label <- cut(age, breaks = c(20,45,75,100), labels = c("yong","middle","old"))
age_label
## [1] middle old middle middle old middle middle old old middle
## [11] yong middle middle middle old middle middle yong yong middle
## [21] yong yong middle old yong middle yong middle yong middle
## [31] yong yong old middle old old yong yong old middle
## [41] middle yong middle old yong middle old yong middle yong
## Levels: yong middle old
# 想知道今天日期
today <- Sys.Date()
# ISO 8601的标准yyyy-mm-dd
print(today)
## [1] "2018-07-10"
typeof(today)
## [1] "double"
class(today)
## [1] "Date"
# 字符串转化为时间格式
secondDay <- as.Date("2018-07-10")
print(secondDay)
## [1] "2018-07-10"
# 如果我们收集的数据里面的时间格式是07|10|2018如何处理?
thirdDay <- c("07|11|2018")
class(thirdDay)
## [1] "character"
print(thirdDay)
## [1] "07|11|2018"
thirdDay <- as.Date(thirdDay, format = "%m|%d|%y")
class(thirdDay)
## [1] "Date"
print(thirdDay)
## [1] "2020-07-11"
#如果我们收集的数据里面的时间格式是07|10|2018如何处理?
format 参数 |
描述 |
---|---|
%b |
缩写月份Jan |
%B |
完整月份名称 |
%d |
两位数的日期 |
%m |
两位数的月份 |
%y |
没有世纪的年份 |
%Y |
有世纪的年份 |
today <- ISOdate(year = 2018,month = 07,day = 10)
print(today)
## [1] "2018-07-10 12:00:00 GMT"
class(today)
## [1] "POSIXct" "POSIXt"
today <- as.Date(today)
class(today)
## [1] "Date"
y <- 2015:2018
m <- 5:8
d <- 5:8
ISOdate(y,m,d)
## [1] "2015-05-05 12:00:00 GMT" "2016-06-06 12:00:00 GMT"
## [3] "2017-07-07 12:00:00 GMT" "2018-08-08 12:00:00 GMT"
# ISOdatetime可以扩展到“时,分,秒”
#时间是相对概念,所以R中选取的标准时间是1970-01-01
as.integer(today)
## [1] 17722
julian(today)
## [1] 17722
## attr(,"origin")
## [1] "1970-01-01"
as.integer(as.Date("1970-01-02"))
## [1] 1
# 提取日期一部分方法:把Date的对象转化为POSIXlt的对象。它是一个日期各个组成元素的一个列表。
today_list <- as.POSIXlt(today)
print(today_list)
## [1] "2018-07-10 UTC"
class(today_list)
## [1] "POSIXlt" "POSIXt"
today_list$mday
## [1] 10
today_list$mon # January = 0
## [1] 6
today_list$year
## [1] 118
today_list$year + 1900
## [1] 2018
POSIXlt 列表的参数 |
描述 |
---|---|
sec |
秒数 |
min |
分钟数 |
hour |
小时数 |
mday |
该月天数 |
mon |
月份 |
year |
自1990年起的年份 |
wday |
该周的某一天 |
yday |
该年的某一天 |
isdst |
夏令时标记 |
# 创建日期序列
f <- as.Date("2018-07-01")
t <- as.Date("2018-07-10")
seq(from = f, to = t, by = 1)
## [1] "2018-07-01" "2018-07-02" "2018-07-03" "2018-07-04" "2018-07-05"
## [6] "2018-07-06" "2018-07-07" "2018-07-08" "2018-07-09" "2018-07-10"
# 日期的相关计算
#做差
#求任意两个日期距离的天数
begin=as.Date("2016-03-04")
end=as.Date("2016-05-08")
(during=end-begin)
## Time difference of 65 days
#求任意两个日期距离的周数和小时数
difftime(end,begin,units = "weeks")
## Time difference of 9.285714 weeks
difftime(end,begin,units = "hours")
## Time difference of 1560 hours
以上介绍的都是base基础包里自带的函数,下面要为大家介绍一款专门高效处理时间数据各种疑难杂症的包lubridate
. 这是一个民间口碑极佳的数据预处理包包,曾屡次获得“能大幅提高数据预处理效率的精品”,“最让人感觉相见恨晚的包包”等美誉,足见它的群众基础多么深厚!lubridate包主要有两类函数,一类是处理时点数据,另一类是处理时段数据,不仅功能强大,而且相应函数也很直观易懂,比如把字符转化成时间类型,根本不需要输入匹配的format,比如提取时间数据细节,也只是一个小函数即可完美搞定,不附带任何参数。
## [1] "2009-01-01" "2009-01-02" "2009-01-03" "2009-01-04" "2009-01-05"
## [6] "2009-01-06" "2009-01-07"
## [1] 20
## [1] 6
## [1] 1
## [1] 30
q <- c(1, NaN, NA, 4)
class(q)
## [1] "numeric"
is.nan(q)
## [1] FALSE TRUE FALSE FALSE
is.na(q)
## [1] FALSE TRUE TRUE FALSE
mean(q)
## [1] NaN
mean(q, na.rm = T)
## [1] 2.5
is.na(NA)
is.na(NaN)
is.nan(NaN)
is.na(NA)
data <- read.csv("./data/data.csv", header = TRUE)
data <- read.table("./data/data.txt", header = TRUE)
write.table(data, file = "./data/dataout.txt")
write.csv(data, file = "./data/dataout.csv")
s <- 1:5
ifelse(s > 3, yes = "Good", no = "Fair")
## [1] "Fair" "Fair" "Fair" "Good" "Good"
a <- 1
if(a > 2){
b <- 1
}else{
b <- 0
}
b
## [1] 0
a <- -1
if(a > 2){
b <- 1
}else if(a < 0){
b <- 1
}else{
b <- 0
}
b
## [1] 1
for(i in 1:3){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
M <- matrix(1:2, 2, 3)
M
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
numeric(3)
## [1] 0 0 0
col.sum <- numeric(ncol(M))
for(j in 1:ncol(M)){
for(i in 1:nrow(M)){
col.sum[j] <- col.sum[j] + M[i, j]
}
}
col.sum
## [1] 3 3 3
M
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
col.sum
## [1] 3 3 3
i <- 0
while(i < 3){
print(i)
i <- i + 1
}
## [1] 0
## [1] 1
## [1] 2
i <- 0
repeat{
print(i)
ifelse(i > 1, yes = break, no = i <- i + 1)
}
## [1] 0
## [1] 1
## [1] 2
for(i in 1:10){
if(i <= 7){
next
}
print(i)
i <- i + 1
}
## [1] 8
## [1] 9
## [1] 10
print(M)
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
apply(M, 2, sum)
## [1] 3 3 3
apply(M, 1, sum)
## [1] 3 6
L <- list(
a = matrix(1:3, 1),
b = data.frame(
char = c("A", "B"),
logic = c(T, F)),
c = list(
1:3,
"T",
matrix(1:6, 2, 3)))
L
## $a
## [,1] [,2] [,3]
## [1,] 1 2 3
##
## $b
## char logic
## 1 A TRUE
## 2 B FALSE
##
## $c
## $c[[1]]
## [1] 1 2 3
##
## $c[[2]]
## [1] "T"
##
## $c[[3]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
lapply(L, class)
## $a
## [1] "matrix"
##
## $b
## [1] "data.frame"
##
## $c
## [1] "list"
sapply(L, class)
## a b c
## "matrix" "data.frame" "list"
mapply(rep, 1:6, 6:1)
## [[1]]
## [1] 1 1 1 1 1 1
##
## [[2]]
## [1] 2 2 2 2 2
##
## [[3]]
## [1] 3 3 3 3
##
## [[4]]
## [1] 4 4 4
##
## [[5]]
## [1] 5 5
##
## [[6]]
## [1] 6
set.seed(6)
n <- 9
bike <- data.frame(
orgn = sample(c("SE", "S", "N"), n, replace = T),
dest = sample(c("SE", "S", "N"), n, replace = T),
flux = round(rnorm(n, 18)))
bike
## orgn dest flux
## 1 S SE 17
## 2 N S 20
## 3 SE N 17
## 4 S SE 19
## 5 N SE 18
## 6 N N 17
## 7 N SE 18
## 8 N S 20
## 9 S N 17
tapply(X = bike$flux, INDEX = bike$orgn, FUN = sum)
## N S SE
## 93 53 17
##
## N S SE
## N 1 2 2
## S 1 0 2
## SE 1 0 0
## dest
## orgn N S SE
## N 17 20 18
## S 17 NA 18
## SE 17 NA NA
命令 | 描述 |
---|---|
apply(X, MARGIN, FUN) |
通过对一个数组或矩阵X 的指定MARGIN 应用FUN 来得到一个向量/数组/列表 |
lapply(X, FUN) |
通过对一个列表X 的元素应用FUN 来得到一个列表 |
sapply(X, FUN) |
lapply 的简化版本返回一个向量/数组而不是列表 |
mapply(FUN) |
lapply 的多变元版本 |
tapply(X, INDEX, FUN) |
通过对INDEX 制定的因子组合应用FUN 函数来得到一个表格 |
rsummary <- function(x) {
return(
list(
mean = mean(x),
sd = sd(x),
quantiles = quantile(x)))
}
x <- rnorm(100, mean = 6, sd = 1)
rsummary(x)
## $mean
## [1] 5.928524
##
## $sd
## [1] 0.990941
##
## $quantiles
## 0% 25% 50% 75% 100%
## 4.047651 5.268584 5.818003 6.522913 8.608098
calGrade <- function(x, base = 59) {
s <- x + base
if(s >= 90) {
grade <- "A"
} else if(s >= 80) {
grade <- "B"
} else if(s >= 70) {
grade <- "C"
} else if(x >= 60){
grade <- "D"
}else{
grade <- "Failed"
}
grade
}
scores <- c(40, 20, 27, 31, 0, 0)
calGrade(scores)
## [1] "A"
lapply(scores, calGrade)
## [[1]]
## [1] "A"
##
## [[2]]
## [1] "C"
##
## [[3]]
## [1] "B"
##
## [[4]]
## [1] "A"
##
## [[5]]
## [1] "Failed"
##
## [[6]]
## [1] "Failed"
square <- function(x) {
return(x^2)
}
square(11)
## [1] 121
square <- function(x) {
x^2
}
square(11)
## [1] 121
formals
formals(calGrade)
## $x
##
##
## $base
## [1] 59
formals(plot)
## $x
##
##
## $y
##
##
## $...
可以缺失,可能有缺省值
参数匹配
(M <- matrix(1:6, nrow = 2)) # 精确匹配
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
(M <- matrix(1:6, nr = 2)) # 部分匹配
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
(M <- matrix(1:6, 2)) # 位置匹配
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
quantile(x, probs = seq(0, 1, 0.25), ...)
apply(X, MARGIN, FUN, ...)
set.seed(1)
(x <- matrix(rnorm(12), 3))
## [,1] [,2] [,3] [,4]
## [1,] -0.6264538 1.5952808 0.4874291 -0.3053884
## [2,] 0.1836433 0.3295078 0.7383247 1.5117812
## [3,] -0.8356286 -0.8204684 0.5757814 0.3898432
apply(x, 2, quantile) # matrix
## [,1] [,2] [,3] [,4]
## 0% -0.8356286 -0.8204684 0.4874291 -0.30538839
## 25% -0.7310412 -0.2454803 0.5316052 0.04222742
## 50% -0.6264538 0.3295078 0.5757814 0.38984324
## 75% -0.2214052 0.9623943 0.6570530 0.95081220
## 100% 0.1836433 1.5952808 0.7383247 1.51178117
quantile(x, probs = seq(0, 1, 0.25), ...)
apply(X, MARGIN, FUN, ...)
set.seed(1)
(x <- matrix(rnorm(12), 3))
## [,1] [,2] [,3] [,4]
## [1,] -0.6264538 1.5952808 0.4874291 -0.3053884
## [2,] 0.1836433 0.3295078 0.7383247 1.5117812
## [3,] -0.8356286 -0.8204684 0.5757814 0.3898432
apply(x, 2, quantile, probs = c(.25, .75))
## [,1] [,2] [,3] [,4]
## 25% -0.7310412 -0.2454803 0.5316052 0.04222742
## 75% -0.2214052 0.9623943 0.6570530 0.95081220
str(paste)
## function (..., sep = " ", collapse = NULL)
paste("Hi", " There", sep = "!")
## [1] "Hi! There"
paste("Hi", " There", se = "!")
## [1] "Hi There !"
x <- 1:30
plot(x, log(x))
my.plot <-
function(x, y, type = "l", ...){
plot(x, y, type = type, ...)
}
my.plot(x, log(x))
source("./fn/MyPlotEncap.R")
my.plot.encap(x, log(x))
第一部分:swirl
包是一款专门的进行R语言课后练习的包,你可以按照包的提示去一点点学习R语言。本门课的作业部分,会有部分内容的作业是由这个包来完成。这个包的详细介绍请戳这里。整个包的代码库请戳这里。请按照下面操作,完成本作业。
knitr::include_graphics('fig/swirl.png')
首先,安装swirl
包如下:
install.packages("swirl")
其次,启动swirl
包如下:
# 载入swirl
library("swirl")
# 启动swirl
swirl()
这里会出现如下界面:
输入swirl()
后,会继续出现
输入你自己的名字,然后会进入课程选择 好了,第二天作业,大家能完成1-5.
第二部分:
观看狗熊会视频微课中的R语言入门视频,学习如何导入数据到R。
下载Speed Dating的数据,数据描述与文章
建立一个R Markdown文档
读入Speed Dating的数据
修改gender为因子类型
计算试验男女比例
计算男性在约会前对于自己心仪对象的各个维度评分的均值(attr1_1“,”sinc1_1“,”intel1_1“,”fun1_1“,”amb1_1“,”shar1_1“)
计算女性约会前对于自己心仪对象的各个维度评分均值(attr1_1“,”sinc1_1“,”intel1_1“,”fun1_1“,”amb1_1“,”shar1_1)
你有什么结论吗?
思考还有什么问题可以研究?