{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 数据管理\n",
"\n",
"本节内容可应用在数据读取之后。包括基本的运算(包括统计函数)、数据重整(排序、合并、子集、随机抽样、整合、重塑等)、字符串处理、异常值(NA/Inf/NaN)处理等内容。也包括 apply() 这种函数式编程函数的使用。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数学函数\n",
"\n",
"数学运算符和一些统计学上需要的函数。\n",
"\n",
"### 数学运算符\n",
"\n",
"| 四则 | 幂运算 | 求余 | 整除 |\n",
"| --- | --- | --- | --- |\n",
"| +, -, \\*, / | ^ 或 \\*\\* | %% | %/% |\n",
"\n",
"例子:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 8 1 2\n"
]
}
],
"source": [
"a <- 2 ^ 3\n",
"b <- 5 %% 2\n",
"c <- 5 %/% 2\n",
"print(c(a, b, c))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 基本数学函数\n",
"\n",
"- 绝对值:abs()\n",
"- 平方根:sqrt()\n",
"- 三角函数:sin(), cos(), tan(), acos(), asin(), atan()\n",
"- 对数:\n",
" - log(x, base=n) 以 n 为底 x 的对数\n",
" - log10(x) 以 10 为底的对数\n",
"- 指数:exp()\n",
"- 取整:\n",
" - 向上取整 ceiling()\n",
" - 向下取整 floor()\n",
" - 舍尾取整(绝对值减小) trunc()\n",
" - 四舍五入到第 N 位 round(x, digits=N)\n",
" - 四舍五入为有效数字共 N 位 singif(x, digits=N)\n",
"\n",
"### 统计、概率与随机数\n",
"\n",
"描述性统计等更多的统计内容,参考 [“描述性统计”一文](DescriptiveStatistics.ipynb)。\n",
"\n",
"#### 统计函数\n",
"\n",
"常用的统计函数:\n",
"\n",
"- 均值:mean()\n",
"- 中位数:median()\n",
"- 标准差:sd()\n",
"- 方差:var()\n",
"- 绝对中位差:mad(x, center=median(x), constant=1.4826, ...),计算式:\n",
"\n",
"$$ \\mathrm{mad}(x) = constant * \\mathrm{Median}(|x - center|)$$\n",
"\n",
"- 分位数:quantile(x, probs),例如 quantile(x, c(.3, 84%)) 返回 x 的 30% 和 84% 分位数。\n",
"- 极值:min() & max()\n",
"- 值域与极差:range(x),例如 range(c(1, 2, 3)) 结果为 c(1, 3)。极差用 diff(range(x))\n",
"- 差分:diff(x, lag=1)。可以用 lag 指定滞后项的个数,默认 1\n",
"- 标准化:scale(x, center=TRUE, scale=TRUE)。可以使用 scale(x) * SD + C 来获得标准差为 SD、均值为 C 的标准化结果。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 概率函数\n",
"\n",
"常用的概率分布函数:\n",
"\n",
"- 正态分布:norm\n",
"- 泊松分布:pois\n",
"- 均匀分布:unif\n",
"- Beta 分布:beta\n",
"- 二项分布:binom\n",
"- 柯西分布:cauchy\n",
"- 卡方分布:chisq\n",
"- 指数分布:exp\n",
"- F 分布:f\n",
"- t 分布:t\n",
"- Gamma 分布:gamma\n",
"- 几何分布:geom\n",
"- 超几何分布:hyper\n",
"- 对数正态分布:lnorm\n",
"- Logistic 分布:logis\n",
"- 多项分布:multinom\n",
"- 负二项分布:nbinom\n",
"\n",
"以上各概率函数的缩写记为 *abbr*, 那么对应的概率函数有:\n",
"\n",
"1. **密度函数**: d{abbr}(),例如对于正态就是 dnorm()\n",
"2. **分布函数**:p{abbr}()\n",
"3. **分位数函数**:q{abbr}()\n",
"4. **生成随机数**:r{abbr}(),例如常用的 runif() 生成均匀分布\n",
"\n",
"#### 例子\n",
"\n",
"通过 runif() 产生 $[0, 1]$ 上的服从均匀分布的伪随机数列。通过 set.seed() 可以指定随机数种子,使得代码可以重现。不过**作用域只有跟随其后的那个随机数函数。**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 0.2875775 0.7883051 0.4089769\n"
]
}
],
"source": [
"set.seed(123)\n",
"print(runif(3))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"0.97500210485178"
],
"text/latex": [
"0.97500210485178"
],
"text/markdown": [
"0.97500210485178"
],
"text/plain": [
"[1] 0.9750021"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 位于 1.96 左侧的标准正态分布曲线下方的面积\n",
"pnorm(1.96)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"628.15515655446"
],
"text/latex": [
"628.15515655446"
],
"text/markdown": [
"628.15515655446"
],
"text/plain": [
"[1] 628.1552"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 均值为500,标准差为100 的正态分布的0.9 分位点\n",
"qnorm(.9, mean=500, sd=100)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 44.39524 47.69823 65.58708\n"
]
}
],
"source": [
"# 生成 3 个均值为50,标准差为10 的正态随机数\n",
"set.seed(123)\n",
"print(rnorm(3, mean=50, sd=10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据框操作\n",
"\n",
"数据框是最常使用的数据类型。下面给出数据框使用中一些实用的场景,以及解决方案。\n",
"\n",
"### 行、列操作\n",
"\n",
"#### 新建\n",
"\n",
"创建一个新的列(变量)是很常见的操作。比如我们现在有数据框 df ,想要在右侧新建一个列,使其等于左侧两列的和。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"x1 | x2 | sumx |
\n",
"\n",
"\t1 | 2 | 3 |
\n",
"\t3 | 4 | 7 |
\n",
"\t5 | 6 | 11 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" x1 & x2 & sumx\\\\\n",
"\\hline\n",
"\t 1 & 2 & 3\\\\\n",
"\t 3 & 4 & 7\\\\\n",
"\t 5 & 6 & 11\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x1 | x2 | sumx | \n",
"|---|---|---|\n",
"| 1 | 2 | 3 | \n",
"| 3 | 4 | 7 | \n",
"| 5 | 6 | 11 | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2 sumx\n",
"1 1 2 3 \n",
"2 3 4 7 \n",
"3 5 6 11 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df = data.frame(x1=c(1, 3, 5), x2=c(2, 4, 6))\n",
"# 直接用美元符声明一个新列\n",
"df$sumx <- df$x1 + df$x2\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"x1 | x2 | sumx | sumx2 |
\n",
"\n",
"\t1 | 2 | 3 | 3 |
\n",
"\t3 | 4 | 7 | 7 |
\n",
"\t5 | 6 | 11 | 11 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" x1 & x2 & sumx & sumx2\\\\\n",
"\\hline\n",
"\t 1 & 2 & 3 & 3\\\\\n",
"\t 3 & 4 & 7 & 7\\\\\n",
"\t 5 & 6 & 11 & 11\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x1 | x2 | sumx | sumx2 | \n",
"|---|---|---|\n",
"| 1 | 2 | 3 | 3 | \n",
"| 3 | 4 | 7 | 7 | \n",
"| 5 | 6 | 11 | 11 | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2 sumx sumx2\n",
"1 1 2 3 3 \n",
"2 3 4 7 7 \n",
"3 5 6 11 11 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 或者使用 transform 函数\n",
"df <- transform(df, sumx2=x1+x2)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 重命名"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"x1\" \"x2\" \"sumx\" \"SUM\" \n"
]
}
],
"source": [
"colnames(df)[4] <- \"SUM\"\n",
"print(colnames(df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 选取/剔除: subset()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"x1 | x2 |
\n",
"\n",
"\t1 | 2 |
\n",
"\t3 | 4 |
\n",
"\t5 | 6 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" x1 & x2\\\\\n",
"\\hline\n",
"\t 1 & 2\\\\\n",
"\t 3 & 4\\\\\n",
"\t 5 & 6\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x1 | x2 | \n",
"|---|---|---|\n",
"| 1 | 2 | \n",
"| 3 | 4 | \n",
"| 5 | 6 | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2\n",
"1 1 2 \n",
"2 3 4 \n",
"3 5 6 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 选取前两列\n",
"df[,1:2] # 或者 df[c(\"x1\", \"x2\")]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"x1 | x2 | SUM |
\n",
"\n",
"\t1 | 2 | 3 |
\n",
"\t3 | 4 | 7 |
\n",
"\t5 | 6 | 11 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" x1 & x2 & SUM\\\\\n",
"\\hline\n",
"\t 1 & 2 & 3\\\\\n",
"\t 3 & 4 & 7\\\\\n",
"\t 5 & 6 & 11\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x1 | x2 | SUM | \n",
"|---|---|---|\n",
"| 1 | 2 | 3 | \n",
"| 3 | 4 | 7 | \n",
"| 5 | 6 | 11 | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2 SUM\n",
"1 1 2 3 \n",
"2 3 4 7 \n",
"3 5 6 11 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 剔除列 sumx\n",
"df <- df[!names(df) == \"sumx\"]\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"x1 | x2 |
\n",
"\n",
"\t1 | 2 |
\n",
"\t3 | 4 |
\n",
"\t5 | 6 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" x1 & x2\\\\\n",
"\\hline\n",
"\t 1 & 2\\\\\n",
"\t 3 & 4\\\\\n",
"\t 5 & 6\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x1 | x2 | \n",
"|---|---|---|\n",
"| 1 | 2 | \n",
"| 3 | 4 | \n",
"| 5 | 6 | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2\n",
"1 1 2 \n",
"2 3 4 \n",
"3 5 6 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 剔除第三列\n",
"df <- df[-c(3)] # 或者 df[c(-3)]\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"至于选取行,与列的操作方式是类似的:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | x1 | x2 |
\n",
"\n",
"\t2 | 3 | 4 |
\n",
"\t3 | 5 | 6 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" & x1 & x2\\\\\n",
"\\hline\n",
"\t2 & 3 & 4\\\\\n",
"\t3 & 5 & 6\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | x1 | x2 | \n",
"|---|---|\n",
"| 2 | 3 | 4 | \n",
"| 3 | 5 | 6 | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2\n",
"2 3 4 \n",
"3 5 6 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 选取 x1>2 且 x2为偶数的观测(行)\n",
"df[df$x1 > 2 & df$x2 %% 2 ==0,]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"再介绍一个 subset() 指令,非常简单粗暴。先来一个复杂点的数据集:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"age | gender | q1 | q2 | q3 |
\n",
"\n",
"\t22 | Male | 1 | 4 | 3 |
\n",
"\t37 | Female | 5 | 4 | 2 |
\n",
"\t28 | Male | 3 | 5 | 4 |
\n",
"\t33 | Female | 3 | 3 | 3 |
\n",
"\t43 | Male | 2 | 1 | 1 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lllll}\n",
" age & gender & q1 & q2 & q3\\\\\n",
"\\hline\n",
"\t 22 & Male & 1 & 4 & 3 \\\\\n",
"\t 37 & Female & 5 & 4 & 2 \\\\\n",
"\t 28 & Male & 3 & 5 & 4 \\\\\n",
"\t 33 & Female & 3 & 3 & 3 \\\\\n",
"\t 43 & Male & 2 & 1 & 1 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"age | gender | q1 | q2 | q3 | \n",
"|---|---|---|---|---|\n",
"| 22 | Male | 1 | 4 | 3 | \n",
"| 37 | Female | 5 | 4 | 2 | \n",
"| 28 | Male | 3 | 5 | 4 | \n",
"| 33 | Female | 3 | 3 | 3 | \n",
"| 43 | Male | 2 | 1 | 1 | \n",
"\n",
"\n"
],
"text/plain": [
" age gender q1 q2 q3\n",
"1 22 Male 1 4 3 \n",
"2 37 Female 5 4 2 \n",
"3 28 Male 3 5 4 \n",
"4 33 Female 3 3 3 \n",
"5 43 Male 2 1 1 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"DF <- data.frame(age = c(22, 37, 28, 33, 43),\n",
" gender = c(1, 2, 1, 2, 1),\n",
" q1 = c(1, 5, 3, 3, 2),\n",
" q2 = c(4, 4, 5, 3, 1),\n",
" q3 = c(3, 2, 4, 3, 1))\n",
"DF$gender <- factor(DF$gender, labels=c(\"Male\", \"Female\"))\n",
"\n",
"DF"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | age | gender | q1 | q2 |
\n",
"\n",
"\t2 | 37 | Female | 5 | 4 |
\n",
"\t3 | 28 | Male | 3 | 5 |
\n",
"\t4 | 33 | Female | 3 | 3 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" & age & gender & q1 & q2\\\\\n",
"\\hline\n",
"\t2 & 37 & Female & 5 & 4 \\\\\n",
"\t3 & 28 & Male & 3 & 5 \\\\\n",
"\t4 & 33 & Female & 3 & 3 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | age | gender | q1 | q2 | \n",
"|---|---|---|\n",
"| 2 | 37 | Female | 5 | 4 | \n",
"| 3 | 28 | Male | 3 | 5 | \n",
"| 4 | 33 | Female | 3 | 3 | \n",
"\n",
"\n"
],
"text/plain": [
" age gender q1 q2\n",
"2 37 Female 5 4 \n",
"3 28 Male 3 5 \n",
"4 33 Female 3 3 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 选中年龄介于 25 与 40 之间的观测\n",
"# 并只保留变量 age 到 q2\n",
"subset(DF, age > 25 & age < 40, select=age:q2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 横向合并\n",
"\n",
"如果你有两个**行数相同**的数据框,你可以使用 merge() 将其进行内联合并(inner join),他们将通过一个或多个共有的变量进行合并。"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"ID | Sym | Oprtr.x | Oprtr.y |
\n",
"\n",
"\t1 | A | x | x |
\n",
"\t2 | B | y | z |
\n",
"\t3 | C | z | y |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" ID & Sym & Oprtr.x & Oprtr.y\\\\\n",
"\\hline\n",
"\t 1 & A & x & x\\\\\n",
"\t 2 & B & y & z\\\\\n",
"\t 3 & C & z & y\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"ID | Sym | Oprtr.x | Oprtr.y | \n",
"|---|---|---|\n",
"| 1 | A | x | x | \n",
"| 2 | B | y | z | \n",
"| 3 | C | z | y | \n",
"\n",
"\n"
],
"text/plain": [
" ID Sym Oprtr.x Oprtr.y\n",
"1 1 A x x \n",
"2 2 B y z \n",
"3 3 C z y "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df1 <- data.frame(ID=c(1, 2, 3), Sym=c(\"A\", \"B\", \"C\"), Oprtr=c(\"x\", \"y\", \"z\"))\n",
"df2 <- data.frame(ID=c(1, 3, 2), Oprtr=c(\"x\", \"y\", \"z\"))\n",
"\n",
"# 按 ID 列合并\n",
"merge(df1, df2, by=\"ID\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"ID | Oprtr | Sym |
\n",
"\n",
"\t1 | x | A |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" ID & Oprtr & Sym\\\\\n",
"\\hline\n",
"\t 1 & x & A\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"ID | Oprtr | Sym | \n",
"|---|\n",
"| 1 | x | A | \n",
"\n",
"\n"
],
"text/plain": [
" ID Oprtr Sym\n",
"1 1 x A "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 由于 ID 与 Oprtr 一致的只有一行,因此其余的都舍弃\n",
"merge(df1, df2, by=c(\"ID\", \"Oprtr\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"或者直接用 cbind() 函数组合。"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"ID | Sym | Oprtr | ID | Oprtr |
\n",
"\n",
"\t1 | A | x | 1 | x |
\n",
"\t2 | B | y | 3 | y |
\n",
"\t3 | C | z | 2 | z |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lllll}\n",
" ID & Sym & Oprtr & ID & Oprtr\\\\\n",
"\\hline\n",
"\t 1 & A & x & 1 & x\\\\\n",
"\t 2 & B & y & 3 & y\\\\\n",
"\t 3 & C & z & 2 & z\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"ID | Sym | Oprtr | ID | Oprtr | \n",
"|---|---|---|\n",
"| 1 | A | x | 1 | x | \n",
"| 2 | B | y | 3 | y | \n",
"| 3 | C | z | 2 | z | \n",
"\n",
"\n"
],
"text/plain": [
" ID Sym Oprtr ID Oprtr\n",
"1 1 A x 1 x \n",
"2 2 B y 3 y \n",
"3 3 C z 2 z "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 直接组合。注意:列名相同的话,在按列名调用时右侧的会被忽略\n",
"cbind(df1, df2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 纵向合并\n",
"\n",
"相当于追加观测。两个数据框必须有**相同的变量**,尽管顺序可以不同。如果两个数据框变量不同请:\n",
"\n",
"- 删除多余变量;\n",
"- 在缺少变量的数据框中,追加同名变量并将其设为缺失值 NA。"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"ID | Sym | Oprtr |
\n",
"\n",
"\t1 | A | x |
\n",
"\t2 | B | y |
\n",
"\t3 | C | z |
\n",
"\t1 | NA | x |
\n",
"\t3 | NA | y |
\n",
"\t2 | NA | z |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" ID & Sym & Oprtr\\\\\n",
"\\hline\n",
"\t 1 & A & x \\\\\n",
"\t 2 & B & y \\\\\n",
"\t 3 & C & z \\\\\n",
"\t 1 & NA & x \\\\\n",
"\t 3 & NA & y \\\\\n",
"\t 2 & NA & z \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"ID | Sym | Oprtr | \n",
"|---|---|---|---|---|---|\n",
"| 1 | A | x | \n",
"| 2 | B | y | \n",
"| 3 | C | z | \n",
"| 1 | NA | x | \n",
"| 3 | NA | y | \n",
"| 2 | NA | z | \n",
"\n",
"\n"
],
"text/plain": [
" ID Sym Oprtr\n",
"1 1 A x \n",
"2 2 B y \n",
"3 3 C z \n",
"4 1 NA x \n",
"5 3 NA y \n",
"6 2 NA z "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df1 <- data.frame(ID=c(1, 2, 3), Sym=c(\"A\", \"B\", \"C\"), Oprtr=c(\"x\", \"y\", \"z\"))\n",
"df2 <- data.frame(ID=c(1, 3, 2), Oprtr=c(\"x\", \"y\", \"z\"))\n",
"df2$Sym <- NA\n",
"\n",
"rbind(df1, df2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 逻辑型筛选\n",
"\n",
"通过逻辑判断来过滤数据,或者选取数据子集,或者将子集作统一更改。在前面的一些例子中已经使用到了。"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"x1 | x2 | x3 |
\n",
"\n",
"\t1 | 2 | NA |
\n",
"\t3 | 4 | 8 |
\n",
"\t5 | 6 | NA |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" x1 & x2 & x3\\\\\n",
"\\hline\n",
"\t 1 & 2 & NA\\\\\n",
"\t 3 & 4 & 8\\\\\n",
"\t 5 & 6 & NA\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x1 | x2 | x3 | \n",
"|---|---|---|\n",
"| 1 | 2 | NA | \n",
"| 3 | 4 | 8 | \n",
"| 5 | 6 | NA | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2 x3\n",
"1 1 2 NA\n",
"2 3 4 8\n",
"3 5 6 NA"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df$x3 <- c(7, 8, 9)\n",
"# 把列 x3 中的奇数换成 NA\n",
"df$x3[df$x3 %% 2 == 1] <- NA\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"x1 | x2 | x3 | y |
\n",
"\n",
"\tNaN | NaN | NA | 7 |
\n",
"\t 3 | 4 | 8 | -Inf |
\n",
"\t 5 | 6 | NA | Inf |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" x1 & x2 & x3 & y\\\\\n",
"\\hline\n",
"\t NaN & NaN & NA & 7\\\\\n",
"\t 3 & 4 & 8 & -Inf\\\\\n",
"\t 5 & 6 & NA & Inf\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x1 | x2 | x3 | y | \n",
"|---|---|---|\n",
"| NaN | NaN | NA | 7 | \n",
"| 3 | 4 | 8 | -Inf | \n",
"| 5 | 6 | NA | Inf | \n",
"\n",
"\n"
],
"text/plain": [
" x1 x2 x3 y \n",
"1 NaN NaN NA 7\n",
"2 3 4 8 -Inf\n",
"3 5 6 NA Inf"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df$y <- c(7, 12, 27)\n",
"# 把所有小于 3 的标记为 NaN\n",
"# 把所有大于 10 的数按奇偶标记为正负Inf\n",
"\n",
"df[df < 3] <- NaN\n",
"df[df > 10 & df %% 2 == 1] <- Inf\n",
"df[df > 10 & df %% 2 == 0] <- -Inf\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 排序\n",
"\n",
"排序使用 order() 命令。"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | age | gender |
\n",
"\n",
"\t5 | 43 | Male |
\n",
"\t3 | 28 | Male |
\n",
"\t1 | 22 | Male |
\n",
"\t2 | 37 | Female |
\n",
"\t4 | 33 | Female |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" & age & gender\\\\\n",
"\\hline\n",
"\t5 & 43 & Male \\\\\n",
"\t3 & 28 & Male \\\\\n",
"\t1 & 22 & Male \\\\\n",
"\t2 & 37 & Female\\\\\n",
"\t4 & 33 & Female\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | age | gender | \n",
"|---|---|---|---|---|\n",
"| 5 | 43 | Male | \n",
"| 3 | 28 | Male | \n",
"| 1 | 22 | Male | \n",
"| 2 | 37 | Female | \n",
"| 4 | 33 | Female | \n",
"\n",
"\n"
],
"text/plain": [
" age gender\n",
"5 43 Male \n",
"3 28 Male \n",
"1 22 Male \n",
"2 37 Female\n",
"4 33 Female"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df <- data.frame(age =c(22, 37, 28, 33, 43),\n",
" gender=c(1, 2, 1, 2, 1))\n",
"df$gender <- factor(df$gender, labels=c(\"Male\", \"Female\"))\n",
"\n",
"# 按gender升序排序,各gender内按age降序排序\n",
"df[order(df$gender, -df$age),]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 随机抽样\n",
"\n",
"从已有的数据集中随机抽选样本是常见的做法。例如,其中一份用于构建预测模型,另一份用于验证模型。\n",
"\n",
"```r\n",
"# 无放回地从 df 的所有观测中,抽取一个大小为 3 的样本\n",
"df[sample(1:nrow(df), 3, replace=F)]\n",
"```\n",
"\n",
"随机抽样的 R 包有 sampling 与 survey,如果可能我会在本系列下另建文章介绍。\n",
"\n",
"### SQL语句\n",
"\n",
"在 R 中,借助 sqldf 包可以直接用 SQL 语句操作数据框(data.frame)。一个来自书中的例子:\n",
"\n",
"```r\n",
"newdf <- sqldf(\"select * from mtcars where carb=1 order by mpg\", row.names=TRUE)\n",
"```\n",
"\n",
"这里就不过多涉及了。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 字符串处理\n",
"\n",
"R 中的字符串处理函数有以下几种:\n",
"\n",
"### 通用函数\n",
"\n",
"| 函数 | 含义 |\n",
"| --- | --- |\n",
"| nchar(x) | 计算字符串的长度 |\n",
"| substr(x, start, stop) | 提取子字符串 |\n",
"| grep(pattern, x, ignore.case=FALSE, fixed=FALSE) | 正则搜索,返回为匹配的下标。如果 fixed=T,则按字符串而不是正则搜索。 |\n",
"| grepl() | 类似 grep(),只不过返回值是逻辑值向量。 |\n",
"| sub(pattern, replacement, x, ignore.base=FALSE, fixed=FALSE) | 在 x 中搜索正则式,并以 replacement 将其替换。如果 fixed=T,则按字符串而不是正则搜索 |\n",
"| strsplit(x, split, fixed=FALSE) | 在 split 处分割字符向量 x 中的元素,返回一个列表。 |\n",
"| paste(x1, x2, ..., sep=\"\") | 连接字符串,连接符为 sep。也可以连接重复字串:`paste(\"x\", 1:3, sep=\"\")` |\n",
"| toupper(x) | 转换字符串为全大写 |\n",
"| tolower(x) | 转换字符串为全小写 |\n",
"\n",
"一些例子。首先是正则表达式的使用:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1]]\n",
"[1] 1 2 3 4\n",
"\n",
"[[2]]\n",
"[1] 4\n",
"\n",
"[[3]]\n",
"[1] \"Hey\" \"Hey\" \"Hey\" \"Hey5\"\n",
"\n",
"[[4]]\n",
"[1] \"abc\" \"abcc\" \"abccc\" \"NEW\" \n",
"\n"
]
}
],
"source": [
"streg <- c(\"abc\", \"abcc\", \"abccc\", \"abc5\")\n",
"re1 <- grep(\"abc*\", streg)\n",
"re2 <- grep(\"abc\\\\d\", streg) # 注意反斜杠要双写来在 R 中转义\n",
"re3 <- sub(\"[a-z]*\", \"Hey\", streg)\n",
"re4 <- sub(\"[a-z]*\\\\d\", \"NEW\", streg)\n",
"\n",
"print(list(re1, re2, re3, re4))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"然后是字符串分割与连接。注意这里的 paste() 有非常巧妙的用法:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1]]\n",
"[[1]][[1]]\n",
"[1] \"ab\"\n",
"\n",
"[[1]][[2]]\n",
"[1] \"ab\" \"\" \n",
"\n",
"[[1]][[3]]\n",
"[1] \"ab\" \"\" \"\" \n",
"\n",
"[[1]][[4]]\n",
"[1] \"ab\" \"5\" \n",
"\n",
"\n",
"[[2]]\n",
"[1] \"a-b-c\"\n",
"\n",
"[[3]]\n",
"[1] \"x1\" \"x2\" \"x3\"\n",
"\n"
]
}
],
"source": [
"splt <- strsplit(streg, \"c\") # 结果中不含分隔符 \"c\"\n",
"cat1 <- paste(\"a\", \"b\", \"c\", sep=\"-\")\n",
"cat2 <- paste(\"x\", 1:3, sep=\"\") # 生成列名时非常有用\n",
"\n",
"print(list(splt, cat1, cat2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 日期型字符串\n",
"\n",
"与其他类型相似,日期型字符串能够通过 as.Date() 函数处理。各格式字符的含义如下:\n",
"\n",
"| 符号 | 含义 | 通用示例 | 中文示例 |\n",
"| --- | --- | --- | --- |\n",
"| %d | 日(1~31) | 22 | 22 |\n",
"| %a | 缩写星期 | Mon | 周一 |\n",
"| %A | 全写星期 | Monday | 星期一 |\n",
"| %m | 月(1~12) | 10 | 10 |\n",
"| %b | 缩写月 | Jan | 1月 |\n",
"| %B | 全写月 | January | 一月 |\n",
"| %y | 两位年 | 17 | 17 |\n",
"| %Y | 四位年 | 2017 | 2017 |"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] \"2017-01-28\"\n"
]
}
],
"source": [
"# 对字符串数据 x,用法:as.Date(x, format=, ...)\n",
"dates <- as.Date(\"01-28-2017\", format=\"%m-%d-%Y\")\n",
"print(dates)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"要想获得当前的日期或时间,有两种格式可以参考,并可以用 format() 函数辅助输出。"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"'星期六'"
],
"text/latex": [
"'星期六'"
],
"text/markdown": [
"'星期六'"
],
"text/plain": [
"[1] \"星期六\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Sys.Date() 返回一个精确到日的标准日期格式\n",
"dates1 <- Sys.Date()\n",
"format(dates1, format=\"%A\") # 可以指定输出格式"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"'Sat Apr 22 15:30:54 2017'"
],
"text/latex": [
"'Sat Apr 22 15:30:54 2017'"
],
"text/markdown": [
"'Sat Apr 22 15:30:54 2017'"
],
"text/plain": [
"[1] \"Sat Apr 22 15:30:54 2017\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# date() 返回一个精确到秒的详细的字串\n",
"dates2 <- date()\n",
"dates2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"函数 difftime() 提供了计算时间差的方式。其中计量单位可以是以下之一:\"auto\", \"secs\", \"mins\", \"hours\", \"days\", \"weeks\"。\n",
"\n",
"截至本文最后更新,我有 1100+ 周大。唔……这好像听起来没什么感觉"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Time difference of 1169.429 weeks"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dates1 <- as.Date(\"1994-11-23\")\n",
"dates2 <- Sys.Date()\n",
"difftime(dates2, dates1, units=\"weeks\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 异常值处理\n",
"\n",
"异常值包括三类:\n",
"\n",
"- NA:缺失值。\n",
"- Inf:正无穷。用 -Inf 表示负无穷。**无穷与数可以比较大小,**比如 -Inf < 3 为真。\n",
"- NaN:非可能值。比如 0/0。\n",
"\n",
"使用 is.na() 函数判断数据集中是否存在 NA 或者 NaN,并返回矩阵。注意 NaN 会被判断为缺失值。"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"age | gender |
\n",
"\n",
"\tFALSE | FALSE |
\n",
"\tFALSE | FALSE |
\n",
"\tFALSE | FALSE |
\n",
"\tFALSE | FALSE |
\n",
"\tFALSE | FALSE |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{ll}\n",
" age & gender\\\\\n",
"\\hline\n",
"\t FALSE & FALSE\\\\\n",
"\t FALSE & FALSE\\\\\n",
"\t FALSE & FALSE\\\\\n",
"\t FALSE & FALSE\\\\\n",
"\t FALSE & FALSE\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"age | gender | \n",
"|---|---|---|---|---|\n",
"| FALSE | FALSE | \n",
"| FALSE | FALSE | \n",
"| FALSE | FALSE | \n",
"| FALSE | FALSE | \n",
"| FALSE | FALSE | \n",
"\n",
"\n"
],
"text/plain": [
" age gender\n",
"[1,] FALSE FALSE \n",
"[2,] FALSE FALSE \n",
"[3,] FALSE FALSE \n",
"[4,] FALSE FALSE \n",
"[5,] FALSE FALSE "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"is.na(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另外也有类似的函数来判断 Inf 与 NaN,但只能对一维数据集使用:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] TRUE TRUE FALSE\n"
]
}
],
"source": [
"print(c(is.infinite(c(Inf, -Inf)), is.nan(NA)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在进行数据处理之前,处理 NA 缺失值是必须的步骤。如果某些数值过于离群,你也可能需要将其标记为 NA 。行移除是最简单粗暴的处理方法。"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"age | gender |
\n",
"\n",
"\t22 | Male |
\n",
"\t37 | Female |
\n",
"\t28 | Male |
\n",
"\t33 | Female |
\n",
"\t43 | Male |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" age & gender\\\\\n",
"\\hline\n",
"\t 22 & Male \\\\\n",
"\t 37 & Female\\\\\n",
"\t 28 & Male \\\\\n",
"\t 33 & Female\\\\\n",
"\t 43 & Male \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"age | gender | \n",
"|---|---|---|---|---|\n",
"| 22 | Male | \n",
"| 37 | Female | \n",
"| 28 | Male | \n",
"| 33 | Female | \n",
"| 43 | Male | \n",
"\n",
"\n"
],
"text/plain": [
" age gender\n",
"1 22 Male \n",
"2 37 Female\n",
"3 28 Male \n",
"4 33 Female\n",
"5 43 Male "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# NA 行移除\n",
"df <- na.omit(df)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 整合与重构\n",
"\n",
"### 转置\n",
"\n",
"常见的转置方法是 t() 函数:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\t1 | 2 |
\n",
"\t3 | 4 |
\n",
"\t5 | 6 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{ll}\n",
"\t 1 & 2\\\\\n",
"\t 3 & 4\\\\\n",
"\t 5 & 6\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| 1 | 2 | \n",
"| 3 | 4 | \n",
"| 5 | 6 | \n",
"\n",
"\n"
],
"text/plain": [
" [,1] [,2]\n",
"[1,] 1 2 \n",
"[2,] 3 4 \n",
"[3,] 5 6 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df = matrix(1:6, nrow=2, ncol=3)\n",
"t(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 整合:aggregate()\n",
"\n",
"这个函数是非常强大的。语法:\n",
"\n",
" aggregate(x, by=list(), FUN)\n",
" \n",
"其中 x 是待整合的数据对象,by 是分类依据的列,FUN 是待应用的标量函数。"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"b1 | b2 | v1 | v2 |
\n",
"\n",
"\t1 | 95 | 5.0 | 55 |
\n",
"\t2 | 95 | 7.0 | 77 |
\n",
"\t1 | 99 | 5.5 | 55 |
\n",
"\t2 | 99 | NA | NA |
\n",
"\tbig | damp | 3.0 | 33 |
\n",
"\tblue | dry | 3.0 | 33 |
\n",
"\tred | red | 4.0 | 44 |
\n",
"\tred | wet | 1.0 | 11 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" b1 & b2 & v1 & v2\\\\\n",
"\\hline\n",
"\t 1 & 95 & 5.0 & 55 \\\\\n",
"\t 2 & 95 & 7.0 & 77 \\\\\n",
"\t 1 & 99 & 5.5 & 55 \\\\\n",
"\t 2 & 99 & NA & NA \\\\\n",
"\t big & damp & 3.0 & 33 \\\\\n",
"\t blue & dry & 3.0 & 33 \\\\\n",
"\t red & red & 4.0 & 44 \\\\\n",
"\t red & wet & 1.0 & 11 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"b1 | b2 | v1 | v2 | \n",
"|---|---|---|---|---|---|---|---|\n",
"| 1 | 95 | 5.0 | 55 | \n",
"| 2 | 95 | 7.0 | 77 | \n",
"| 1 | 99 | 5.5 | 55 | \n",
"| 2 | 99 | NA | NA | \n",
"| big | damp | 3.0 | 33 | \n",
"| blue | dry | 3.0 | 33 | \n",
"| red | red | 4.0 | 44 | \n",
"| red | wet | 1.0 | 11 | \n",
"\n",
"\n"
],
"text/plain": [
" b1 b2 v1 v2\n",
"1 1 95 5.0 55\n",
"2 2 95 7.0 77\n",
"3 1 99 5.5 55\n",
"4 2 99 NA NA\n",
"5 big damp 3.0 33\n",
"6 blue dry 3.0 33\n",
"7 red red 4.0 44\n",
"8 red wet 1.0 11"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 这个例子改编自 R 的官方帮助 aggregate()\n",
"df <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,6,7,9),\n",
" v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )\n",
"by1 <- c(\"red\", \"blue\", 1, 2, NA, \"big\", 1, 2, \"red\", 1, NA, 12)\n",
"by2 <- c(\"wet\", \"dry\", 99, 95, NA, \"damp\", 95, 99, \"red\", 99, NA, NA)\n",
"\n",
"# 按照 by1 & by2 整合原数据 testDF\n",
"# 注意(by1, by2)=(1, 99) 对应 (v1, v2)=(5, 55) 与 (6,55) 两条数据\n",
"# 因此第三行的 v1 = mean(c(5, 6)) = 5.5\n",
"aggregate(x = df, by = list(b1=by1, b2=by2), FUN = \"mean\")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"by1 | by2 | V1 |
\n",
"\n",
"\t1 | 95 | 5.0 |
\n",
"\t2 | 95 | 7.0 |
\n",
"\t1 | 99 | 5.5 |
\n",
"\tbig | damp | 3.0 |
\n",
"\tblue | dry | 3.0 |
\n",
"\tred | red | 4.0 |
\n",
"\tred | wet | 1.0 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" by1 & by2 & V1\\\\\n",
"\\hline\n",
"\t 1 & 95 & 5.0 \\\\\n",
"\t 2 & 95 & 7.0 \\\\\n",
"\t 1 & 99 & 5.5 \\\\\n",
"\t big & damp & 3.0 \\\\\n",
"\t blue & dry & 3.0 \\\\\n",
"\t red & red & 4.0 \\\\\n",
"\t red & wet & 1.0 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"by1 | by2 | V1 | \n",
"|---|---|---|---|---|---|---|\n",
"| 1 | 95 | 5.0 | \n",
"| 2 | 95 | 7.0 | \n",
"| 1 | 99 | 5.5 | \n",
"| big | damp | 3.0 | \n",
"| blue | dry | 3.0 | \n",
"| red | red | 4.0 | \n",
"| red | wet | 1.0 | \n",
"\n",
"\n"
],
"text/plain": [
" by1 by2 V1 \n",
"1 1 95 5.0\n",
"2 2 95 7.0\n",
"3 1 99 5.5\n",
"4 big damp 3.0\n",
"5 blue dry 3.0\n",
"6 red red 4.0\n",
"7 red wet 1.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 用公式筛选原数据的列,仅整合这些列\n",
"# 注意:v1中的一个含 NA 的观测被移除\n",
"aggregate(cbind(df$v1) ~ by1+by2, FUN = \"mean\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"还有一个强大的整合包 reshape2,这里就不多介绍了。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 函数式编程\n",
"\n",
"函数式编程是每个科学计算语言中的重要内容;操作实现的优先级依次是**矢量运算(例如 df+1)、函数式书写,最后才是循环语句**。在 R 中,函数式编程主要是由 apply 函数族承担。R 中的 apply 函数族包括:\n",
"\n",
"- apply():指定轴向。传入 data.frame,返回 vector.\n",
"- tapply():\n",
"- vapply():\n",
"- lapply():\n",
"- sapply():\n",
"- mapply():\n",
"- rapply():\n",
"- eapply():\n",
"\n",
"下面依次介绍。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### apply():指定多维对象的轴\n",
"\n",
"在 R 中,通过 apply() 可以将函数运用于多维对象。基本语法是:\n",
"\n",
" apply(d, N, FUN, ...)\n",
"\n",
"其中,N 用于指定将函数 FUN 应用于数据 d 的第几维(1为行,2为列)。省略号中可以传入 function 的参数。"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"x | y | z | s |
\n",
"\n",
"\t1 | 5 | 8 | 3 |
\n",
"\t2 | 4 | 6 | 7 |
\n",
"\t3 | 2 | 9 | 4 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" x & y & z & s\\\\\n",
"\\hline\n",
"\t 1 & 5 & 8 & 3\\\\\n",
"\t 2 & 4 & 6 & 7\\\\\n",
"\t 3 & 2 & 9 & 4\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x | y | z | s | \n",
"|---|---|---|\n",
"| 1 | 5 | 8 | 3 | \n",
"| 2 | 4 | 6 | 7 | \n",
"| 3 | 2 | 9 | 4 | \n",
"\n",
"\n"
],
"text/plain": [
" x y z s\n",
"1 1 5 8 3\n",
"2 2 4 6 7\n",
"3 3 2 9 4"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df <- data.frame(x=c(1, 2, 3), y=c(5, 4, 2), z=c(8, 6, 9), s=c(3, 7, 4))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1]]\n",
"x y z s \n",
"2 4 8 4 \n",
"\n",
"[[2]]\n",
"[1] 2.50 3.50 2.75\n",
"\n"
]
}
],
"source": [
"# 计算 df 各列的中位数\n",
"colmean <- apply(df, 2, median)\n",
"# 计算 df 各行的 25 分位数\n",
"rowquan <- apply(df, 1, quantile, probs=.25)\n",
"\n",
"print(list(colmean, rowquan))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### lapply():列表式应用\n",
"\n",
"lapply 函数的本意是对 list 对象进行操作。返回值是 list 类型。"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\t- $a
\n",
"\t\t- 1
\n",
"\t- $b
\n",
"\t\t- 5
\n",
"\t- $c
\n",
"\t\t- 25
\n",
"
\n"
],
"text/latex": [
"\\begin{description}\n",
"\\item[\\$a] 1\n",
"\\item[\\$b] 5\n",
"\\item[\\$c] 25\n",
"\\end{description}\n"
],
"text/markdown": [
"$a\n",
": 1\n",
"$b\n",
": 5\n",
"$c\n",
": 25\n",
"\n",
"\n"
],
"text/plain": [
"$a\n",
"[1] 1\n",
"\n",
"$b\n",
"[1] 5\n",
"\n",
"$c\n",
"[1] 25\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"lst <- list(a=c(0,1), b=c(1,2), c=c(3,4))\n",
"lapply(lst, function(x) {sum(x^2)})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"但同样可以作用于 DataFrame 对象的各个列(因为 DataFrame 对象是类似于各列组成的 list):"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\t- $x
\n",
"\t\t- 6
\n",
"\t- $y
\n",
"\t\t- 11
\n",
"\t- $z
\n",
"\t\t- 23
\n",
"\t- $s
\n",
"\t\t- 14
\n",
"
\n"
],
"text/latex": [
"\\begin{description}\n",
"\\item[\\$x] 6\n",
"\\item[\\$y] 11\n",
"\\item[\\$z] 23\n",
"\\item[\\$s] 14\n",
"\\end{description}\n"
],
"text/markdown": [
"$x\n",
": 6\n",
"$y\n",
": 11\n",
"$z\n",
": 23\n",
"$s\n",
": 14\n",
"\n",
"\n"
],
"text/plain": [
"$x\n",
"[1] 6\n",
"\n",
"$y\n",
"[1] 11\n",
"\n",
"$z\n",
"[1] 23\n",
"\n",
"$s\n",
"[1] 14\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"lapply(df, sum)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### sapply()/vapply():变种 lapply()\n",
"\n",
"sapply() 实质上是一种异化的 lapply(),返回值可以转变为 vector 而不是 list 类型。 "
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"'numeric'"
],
"text/latex": [
"'numeric'"
],
"text/markdown": [
"'numeric'"
],
"text/plain": [
"[1] \"numeric\""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"'list'"
],
"text/latex": [
"'list'"
],
"text/markdown": [
"'list'"
],
"text/plain": [
"[1] \"list\""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"class(sapply(lst, function(x) {sum(x^2)}))\n",
"class(lapply(lst, function(x) {sum(x^2)}))"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" x y z s \n",
" 6 11 23 14 \n"
]
}
],
"source": [
"print(sapply(df, sum))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"参数 simplify=TRUE 是默认值,表示返回 vector 而不是 list。如果改为 FALSE,就退化为 lapply() 函数。"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\t- $x
\n",
"\t\t- 6
\n",
"\t- $y
\n",
"\t\t- 11
\n",
"\t- $z
\n",
"\t\t- 23
\n",
"\t- $s
\n",
"\t\t- 14
\n",
"
\n"
],
"text/latex": [
"\\begin{description}\n",
"\\item[\\$x] 6\n",
"\\item[\\$y] 11\n",
"\\item[\\$z] 23\n",
"\\item[\\$s] 14\n",
"\\end{description}\n"
],
"text/markdown": [
"$x\n",
": 6\n",
"$y\n",
": 11\n",
"$z\n",
": 23\n",
"$s\n",
": 14\n",
"\n",
"\n"
],
"text/plain": [
"$x\n",
"[1] 6\n",
"\n",
"$y\n",
"[1] 11\n",
"\n",
"$z\n",
"[1] 23\n",
"\n",
"$s\n",
"[1] 14\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sapply(df, sum, simplify=FALSE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"vapply() 函数可以通过 FUN.VALUE 参数传入行名称,但这一步往往可以借助 lapply()/sapply() 加上外部的 row.names() 函数完成。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### mapply():多输入值的应用\n",
"\n",
"mapply() 函数支持多个输入值:\n",
"\n",
" mapply(FUN, [input1, input2, ...], MoreArgs=NULL)\n",
" \n",
"其中各 input 的**长度应该相等或互为整数倍数**。该函数的用处在于避免了事先将数据合并。"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" [1] -2.0 -1.0 0.0 1.0 2.0 0.0 0.5 1.0 1.5 2.0\n"
]
}
],
"source": [
"print(mapply(min, seq(0, 2, by=0.5), -2:7))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### tapply():分组应用\n",
"\n",
"tapply() 函数可以借助 factor 的各水平进行分组,然后进行计算。类似于 group by 操作:\n",
"\n",
" tapply(X, idx, FUN)\n",
"\n",
"其中 X 是数据,idx 是分组依据。"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"$a\n",
"[1] 1 4 9\n",
"\n",
"$b\n",
"[1] 2 6 12\n",
"\n"
]
}
],
"source": [
"df <- data.frame(x=1:6, groups=rep(c(\"a\", \"b\"), 3))\n",
"print(tapply(df$x, df$groups, cumsum))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"其他的 apply() 函数很少用到,在此就不介绍了。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 其他实用函数\n",
"\n",
"在本系列的 [“数据读写操作”一文](ReadData.ipynb) 中,也介绍了一些实用的函数,可以参考。\n",
"\n",
"此外还有:\n",
"\n",
"| 函数 | 含义 |\n",
"| --- | --- |\n",
"| seq(from=N, to=N, by=N, [length.out=N, along.with=obj]) | 生成数列。参数分别是起、止、步长、数列长、指定数列长度与某对象等长。 |\n",
"| rep(x, N) | 重复组合。比如 rep(1:2, 2) 会生成一个向量 c(1, 2, 1, 2) |\n",
"| cut(x, N, [ordered_result=F]) | 分割为因子。 将连续变量 x 分割为有 N 个水平的因子,可以指定是否有序。 | \n",
"| pretty(x, N) | 美观分割。将连续变量 x 分割为 N 个区间(N+1 个端点),并使端点为取整值。 绘图中使用。|\n",
"| cat(obj1, obj2, ..., [file=, append=]) | 连接多个对象,并输出到屏幕或文件。 |"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.0.0"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}