{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 数据管理\n", "\n", "本节内容可应用在数据读取之后。包括基本的运算(包括统计函数)、数据重整(排序、合并、子集、随机抽样、整合、重塑等)、字符串处理、异常值(NA/Inf/NaN)处理等内容。也包括 apply() 这种函数式编程函数的使用。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数学函数\n", "\n", "数学运算符和一些统计学上需要的函数。\n", "\n", "### 数学运算符\n", "\n", "| 四则 | 幂运算 | 求余 | 整除 |\n", "| --- | --- | --- | --- |\n", "| +, -, \\*, / | ^ 或 \\*\\* | %% | %/% |\n", "\n", "例子:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 8 1 2\n" ] } ], "source": [ "a <- 2 ^ 3\n", "b <- 5 %% 2\n", "c <- 5 %/% 2\n", "print(c(a, b, c))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 基本数学函数\n", "\n", "- 绝对值:abs()\n", "- 平方根:sqrt()\n", "- 三角函数:sin(), cos(), tan(), acos(), asin(), atan()\n", "- 对数:\n", " - log(x, base=n) 以 n 为底 x 的对数\n", " - log10(x) 以 10 为底的对数\n", "- 指数:exp()\n", "- 取整:\n", " - 向上取整 ceiling()\n", " - 向下取整 floor()\n", " - 舍尾取整(绝对值减小) trunc()\n", " - 四舍五入到第 N 位 round(x, digits=N)\n", " - 四舍五入为有效数字共 N 位 singif(x, digits=N)\n", "\n", "### 统计、概率与随机数\n", "\n", "描述性统计等更多的统计内容,参考 [“描述性统计”一文](DescriptiveStatistics.ipynb)。\n", "\n", "#### 统计函数\n", "\n", "常用的统计函数:\n", "\n", "- 均值:mean()\n", "- 中位数:median()\n", "- 标准差:sd()\n", "- 方差:var()\n", "- 绝对中位差:mad(x, center=median(x), constant=1.4826, ...),计算式:\n", "\n", "$$ \\mathrm{mad}(x) = constant * \\mathrm{Median}(|x - center|)$$\n", "\n", "- 分位数:quantile(x, probs),例如 quantile(x, c(.3, 84%)) 返回 x 的 30% 和 84% 分位数。\n", "- 极值:min() & max()\n", "- 值域与极差:range(x),例如 range(c(1, 2, 3)) 结果为 c(1, 3)。极差用 diff(range(x))\n", "- 差分:diff(x, lag=1)。可以用 lag 指定滞后项的个数,默认 1\n", "- 标准化:scale(x, center=TRUE, scale=TRUE)。可以使用 scale(x) * SD + C 来获得标准差为 SD、均值为 C 的标准化结果。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 概率函数\n", "\n", "常用的概率分布函数:\n", "\n", "- 正态分布:norm\n", "- 泊松分布:pois\n", "- 均匀分布:unif\n", "- Beta 分布:beta\n", "- 二项分布:binom\n", "- 柯西分布:cauchy\n", "- 卡方分布:chisq\n", "- 指数分布:exp\n", "- F 分布:f\n", "- t 分布:t\n", "- Gamma 分布:gamma\n", "- 几何分布:geom\n", "- 超几何分布:hyper\n", "- 对数正态分布:lnorm\n", "- Logistic 分布:logis\n", "- 多项分布:multinom\n", "- 负二项分布:nbinom\n", "\n", "以上各概率函数的缩写记为 *abbr*, 那么对应的概率函数有:\n", "\n", "1. **密度函数**: d{abbr}(),例如对于正态就是 dnorm()\n", "2. **分布函数**:p{abbr}()\n", "3. **分位数函数**:q{abbr}()\n", "4. **生成随机数**:r{abbr}(),例如常用的 runif() 生成均匀分布\n", "\n", "#### 例子\n", "\n", "通过 runif() 产生 $[0, 1]$ 上的服从均匀分布的伪随机数列。通过 set.seed() 可以指定随机数种子,使得代码可以重现。不过**作用域只有跟随其后的那个随机数函数。**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 0.2875775 0.7883051 0.4089769\n" ] } ], "source": [ "set.seed(123)\n", "print(runif(3))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "0.97500210485178" ], "text/latex": [ "0.97500210485178" ], "text/markdown": [ "0.97500210485178" ], "text/plain": [ "[1] 0.9750021" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 位于 1.96 左侧的标准正态分布曲线下方的面积\n", "pnorm(1.96)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "628.15515655446" ], "text/latex": [ "628.15515655446" ], "text/markdown": [ "628.15515655446" ], "text/plain": [ "[1] 628.1552" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 均值为500,标准差为100 的正态分布的0.9 分位点\n", "qnorm(.9, mean=500, sd=100)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 44.39524 47.69823 65.58708\n" ] } ], "source": [ "# 生成 3 个均值为50,标准差为10 的正态随机数\n", "set.seed(123)\n", "print(rnorm(3, mean=50, sd=10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 数据框操作\n", "\n", "数据框是最常使用的数据类型。下面给出数据框使用中一些实用的场景,以及解决方案。\n", "\n", "### 行、列操作\n", "\n", "#### 新建\n", "\n", "创建一个新的列(变量)是很常见的操作。比如我们现在有数据框 df ,想要在右侧新建一个列,使其等于左侧两列的和。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
x1x2sumx
1 2 3
3 4 7
5 6 11
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " x1 & x2 & sumx\\\\\n", "\\hline\n", "\t 1 & 2 & 3\\\\\n", "\t 3 & 4 & 7\\\\\n", "\t 5 & 6 & 11\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x1 | x2 | sumx | \n", "|---|---|---|\n", "| 1 | 2 | 3 | \n", "| 3 | 4 | 7 | \n", "| 5 | 6 | 11 | \n", "\n", "\n" ], "text/plain": [ " x1 x2 sumx\n", "1 1 2 3 \n", "2 3 4 7 \n", "3 5 6 11 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = data.frame(x1=c(1, 3, 5), x2=c(2, 4, 6))\n", "# 直接用美元符声明一个新列\n", "df$sumx <- df$x1 + df$x2\n", "df" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
x1x2sumxsumx2
1 2 3 3
3 4 7 7
5 6 1111
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " x1 & x2 & sumx & sumx2\\\\\n", "\\hline\n", "\t 1 & 2 & 3 & 3\\\\\n", "\t 3 & 4 & 7 & 7\\\\\n", "\t 5 & 6 & 11 & 11\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x1 | x2 | sumx | sumx2 | \n", "|---|---|---|\n", "| 1 | 2 | 3 | 3 | \n", "| 3 | 4 | 7 | 7 | \n", "| 5 | 6 | 11 | 11 | \n", "\n", "\n" ], "text/plain": [ " x1 x2 sumx sumx2\n", "1 1 2 3 3 \n", "2 3 4 7 7 \n", "3 5 6 11 11 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 或者使用 transform 函数\n", "df <- transform(df, sumx2=x1+x2)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 重命名" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"x1\" \"x2\" \"sumx\" \"SUM\" \n" ] } ], "source": [ "colnames(df)[4] <- \"SUM\"\n", "print(colnames(df))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 选取/剔除: subset()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
x1x2
12
34
56
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " x1 & x2\\\\\n", "\\hline\n", "\t 1 & 2\\\\\n", "\t 3 & 4\\\\\n", "\t 5 & 6\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x1 | x2 | \n", "|---|---|---|\n", "| 1 | 2 | \n", "| 3 | 4 | \n", "| 5 | 6 | \n", "\n", "\n" ], "text/plain": [ " x1 x2\n", "1 1 2 \n", "2 3 4 \n", "3 5 6 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 选取前两列\n", "df[,1:2] # 或者 df[c(\"x1\", \"x2\")]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
x1x2SUM
1 2 3
3 4 7
5 6 11
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " x1 & x2 & SUM\\\\\n", "\\hline\n", "\t 1 & 2 & 3\\\\\n", "\t 3 & 4 & 7\\\\\n", "\t 5 & 6 & 11\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x1 | x2 | SUM | \n", "|---|---|---|\n", "| 1 | 2 | 3 | \n", "| 3 | 4 | 7 | \n", "| 5 | 6 | 11 | \n", "\n", "\n" ], "text/plain": [ " x1 x2 SUM\n", "1 1 2 3 \n", "2 3 4 7 \n", "3 5 6 11 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 剔除列 sumx\n", "df <- df[!names(df) == \"sumx\"]\n", "df" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
x1x2
12
34
56
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " x1 & x2\\\\\n", "\\hline\n", "\t 1 & 2\\\\\n", "\t 3 & 4\\\\\n", "\t 5 & 6\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x1 | x2 | \n", "|---|---|---|\n", "| 1 | 2 | \n", "| 3 | 4 | \n", "| 5 | 6 | \n", "\n", "\n" ], "text/plain": [ " x1 x2\n", "1 1 2 \n", "2 3 4 \n", "3 5 6 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 剔除第三列\n", "df <- df[-c(3)] # 或者 df[c(-3)]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "至于选取行,与列的操作方式是类似的:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\n", "
x1x2
234
356
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " & x1 & x2\\\\\n", "\\hline\n", "\t2 & 3 & 4\\\\\n", "\t3 & 5 & 6\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| | x1 | x2 | \n", "|---|---|\n", "| 2 | 3 | 4 | \n", "| 3 | 5 | 6 | \n", "\n", "\n" ], "text/plain": [ " x1 x2\n", "2 3 4 \n", "3 5 6 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 选取 x1>2 且 x2为偶数的观测(行)\n", "df[df$x1 > 2 & df$x2 %% 2 ==0,]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "再介绍一个 subset() 指令,非常简单粗暴。先来一个复杂点的数据集:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
agegenderq1q2q3
22 Male 1 4 3
37 Female5 4 2
28 Male 3 5 4
33 Female3 3 3
43 Male 2 1 1
\n" ], "text/latex": [ "\\begin{tabular}{r|lllll}\n", " age & gender & q1 & q2 & q3\\\\\n", "\\hline\n", "\t 22 & Male & 1 & 4 & 3 \\\\\n", "\t 37 & Female & 5 & 4 & 2 \\\\\n", "\t 28 & Male & 3 & 5 & 4 \\\\\n", "\t 33 & Female & 3 & 3 & 3 \\\\\n", "\t 43 & Male & 2 & 1 & 1 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "age | gender | q1 | q2 | q3 | \n", "|---|---|---|---|---|\n", "| 22 | Male | 1 | 4 | 3 | \n", "| 37 | Female | 5 | 4 | 2 | \n", "| 28 | Male | 3 | 5 | 4 | \n", "| 33 | Female | 3 | 3 | 3 | \n", "| 43 | Male | 2 | 1 | 1 | \n", "\n", "\n" ], "text/plain": [ " age gender q1 q2 q3\n", "1 22 Male 1 4 3 \n", "2 37 Female 5 4 2 \n", "3 28 Male 3 5 4 \n", "4 33 Female 3 3 3 \n", "5 43 Male 2 1 1 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "DF <- data.frame(age = c(22, 37, 28, 33, 43),\n", " gender = c(1, 2, 1, 2, 1),\n", " q1 = c(1, 5, 3, 3, 2),\n", " q2 = c(4, 4, 5, 3, 1),\n", " q3 = c(3, 2, 4, 3, 1))\n", "DF$gender <- factor(DF$gender, labels=c(\"Male\", \"Female\"))\n", "\n", "DF" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
agegenderq1q2
237 Female5 4
328 Male 3 5
433 Female3 3
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " & age & gender & q1 & q2\\\\\n", "\\hline\n", "\t2 & 37 & Female & 5 & 4 \\\\\n", "\t3 & 28 & Male & 3 & 5 \\\\\n", "\t4 & 33 & Female & 3 & 3 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| | age | gender | q1 | q2 | \n", "|---|---|---|\n", "| 2 | 37 | Female | 5 | 4 | \n", "| 3 | 28 | Male | 3 | 5 | \n", "| 4 | 33 | Female | 3 | 3 | \n", "\n", "\n" ], "text/plain": [ " age gender q1 q2\n", "2 37 Female 5 4 \n", "3 28 Male 3 5 \n", "4 33 Female 3 3 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 选中年龄介于 25 与 40 之间的观测\n", "# 并只保留变量 age 到 q2\n", "subset(DF, age > 25 & age < 40, select=age:q2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 横向合并\n", "\n", "如果你有两个**行数相同**的数据框,你可以使用 merge() 将其进行内联合并(inner join),他们将通过一个或多个共有的变量进行合并。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
IDSymOprtr.xOprtr.y
1Axx
2Byz
3Czy
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " ID & Sym & Oprtr.x & Oprtr.y\\\\\n", "\\hline\n", "\t 1 & A & x & x\\\\\n", "\t 2 & B & y & z\\\\\n", "\t 3 & C & z & y\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "ID | Sym | Oprtr.x | Oprtr.y | \n", "|---|---|---|\n", "| 1 | A | x | x | \n", "| 2 | B | y | z | \n", "| 3 | C | z | y | \n", "\n", "\n" ], "text/plain": [ " ID Sym Oprtr.x Oprtr.y\n", "1 1 A x x \n", "2 2 B y z \n", "3 3 C z y " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df1 <- data.frame(ID=c(1, 2, 3), Sym=c(\"A\", \"B\", \"C\"), Oprtr=c(\"x\", \"y\", \"z\"))\n", "df2 <- data.frame(ID=c(1, 3, 2), Oprtr=c(\"x\", \"y\", \"z\"))\n", "\n", "# 按 ID 列合并\n", "merge(df1, df2, by=\"ID\")" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\n", "
IDOprtrSym
1xA
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " ID & Oprtr & Sym\\\\\n", "\\hline\n", "\t 1 & x & A\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "ID | Oprtr | Sym | \n", "|---|\n", "| 1 | x | A | \n", "\n", "\n" ], "text/plain": [ " ID Oprtr Sym\n", "1 1 x A " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 由于 ID 与 Oprtr 一致的只有一行,因此其余的都舍弃\n", "merge(df1, df2, by=c(\"ID\", \"Oprtr\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "或者直接用 cbind() 函数组合。" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
IDSymOprtrIDOprtr
1Ax1x
2By3y
3Cz2z
\n" ], "text/latex": [ "\\begin{tabular}{r|lllll}\n", " ID & Sym & Oprtr & ID & Oprtr\\\\\n", "\\hline\n", "\t 1 & A & x & 1 & x\\\\\n", "\t 2 & B & y & 3 & y\\\\\n", "\t 3 & C & z & 2 & z\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "ID | Sym | Oprtr | ID | Oprtr | \n", "|---|---|---|\n", "| 1 | A | x | 1 | x | \n", "| 2 | B | y | 3 | y | \n", "| 3 | C | z | 2 | z | \n", "\n", "\n" ], "text/plain": [ " ID Sym Oprtr ID Oprtr\n", "1 1 A x 1 x \n", "2 2 B y 3 y \n", "3 3 C z 2 z " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 直接组合。注意:列名相同的话,在按列名调用时右侧的会被忽略\n", "cbind(df1, df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 纵向合并\n", "\n", "相当于追加观测。两个数据框必须有**相同的变量**,尽管顺序可以不同。如果两个数据框变量不同请:\n", "\n", "- 删除多余变量;\n", "- 在缺少变量的数据框中,追加同名变量并将其设为缺失值 NA。" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
IDSymOprtr
1 A x
2 B y
3 C z
1 NAx
3 NAy
2 NAz
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " ID & Sym & Oprtr\\\\\n", "\\hline\n", "\t 1 & A & x \\\\\n", "\t 2 & B & y \\\\\n", "\t 3 & C & z \\\\\n", "\t 1 & NA & x \\\\\n", "\t 3 & NA & y \\\\\n", "\t 2 & NA & z \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "ID | Sym | Oprtr | \n", "|---|---|---|---|---|---|\n", "| 1 | A | x | \n", "| 2 | B | y | \n", "| 3 | C | z | \n", "| 1 | NA | x | \n", "| 3 | NA | y | \n", "| 2 | NA | z | \n", "\n", "\n" ], "text/plain": [ " ID Sym Oprtr\n", "1 1 A x \n", "2 2 B y \n", "3 3 C z \n", "4 1 NA x \n", "5 3 NA y \n", "6 2 NA z " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df1 <- data.frame(ID=c(1, 2, 3), Sym=c(\"A\", \"B\", \"C\"), Oprtr=c(\"x\", \"y\", \"z\"))\n", "df2 <- data.frame(ID=c(1, 3, 2), Oprtr=c(\"x\", \"y\", \"z\"))\n", "df2$Sym <- NA\n", "\n", "rbind(df1, df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 逻辑型筛选\n", "\n", "通过逻辑判断来过滤数据,或者选取数据子集,或者将子集作统一更改。在前面的一些例子中已经使用到了。" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
x1x2x3
1 2 NA
3 4 8
5 6 NA
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " x1 & x2 & x3\\\\\n", "\\hline\n", "\t 1 & 2 & NA\\\\\n", "\t 3 & 4 & 8\\\\\n", "\t 5 & 6 & NA\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x1 | x2 | x3 | \n", "|---|---|---|\n", "| 1 | 2 | NA | \n", "| 3 | 4 | 8 | \n", "| 5 | 6 | NA | \n", "\n", "\n" ], "text/plain": [ " x1 x2 x3\n", "1 1 2 NA\n", "2 3 4 8\n", "3 5 6 NA" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df$x3 <- c(7, 8, 9)\n", "# 把列 x3 中的奇数换成 NA\n", "df$x3[df$x3 %% 2 == 1] <- NA\n", "df" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
x1x2x3y
NaN NaN NA 7
3 4 8 -Inf
5 6 NA Inf
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " x1 & x2 & x3 & y\\\\\n", "\\hline\n", "\t NaN & NaN & NA & 7\\\\\n", "\t 3 & 4 & 8 & -Inf\\\\\n", "\t 5 & 6 & NA & Inf\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x1 | x2 | x3 | y | \n", "|---|---|---|\n", "| NaN | NaN | NA | 7 | \n", "| 3 | 4 | 8 | -Inf | \n", "| 5 | 6 | NA | Inf | \n", "\n", "\n" ], "text/plain": [ " x1 x2 x3 y \n", "1 NaN NaN NA 7\n", "2 3 4 8 -Inf\n", "3 5 6 NA Inf" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df$y <- c(7, 12, 27)\n", "# 把所有小于 3 的标记为 NaN\n", "# 把所有大于 10 的数按奇偶标记为正负Inf\n", "\n", "df[df < 3] <- NaN\n", "df[df > 10 & df %% 2 == 1] <- Inf\n", "df[df > 10 & df %% 2 == 0] <- -Inf\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 排序\n", "\n", "排序使用 order() 命令。" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
agegender
543 Male
328 Male
122 Male
237 Female
433 Female
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " & age & gender\\\\\n", "\\hline\n", "\t5 & 43 & Male \\\\\n", "\t3 & 28 & Male \\\\\n", "\t1 & 22 & Male \\\\\n", "\t2 & 37 & Female\\\\\n", "\t4 & 33 & Female\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| | age | gender | \n", "|---|---|---|---|---|\n", "| 5 | 43 | Male | \n", "| 3 | 28 | Male | \n", "| 1 | 22 | Male | \n", "| 2 | 37 | Female | \n", "| 4 | 33 | Female | \n", "\n", "\n" ], "text/plain": [ " age gender\n", "5 43 Male \n", "3 28 Male \n", "1 22 Male \n", "2 37 Female\n", "4 33 Female" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df <- data.frame(age =c(22, 37, 28, 33, 43),\n", " gender=c(1, 2, 1, 2, 1))\n", "df$gender <- factor(df$gender, labels=c(\"Male\", \"Female\"))\n", "\n", "# 按gender升序排序,各gender内按age降序排序\n", "df[order(df$gender, -df$age),]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 随机抽样\n", "\n", "从已有的数据集中随机抽选样本是常见的做法。例如,其中一份用于构建预测模型,另一份用于验证模型。\n", "\n", "```r\n", "# 无放回地从 df 的所有观测中,抽取一个大小为 3 的样本\n", "df[sample(1:nrow(df), 3, replace=F)]\n", "```\n", "\n", "随机抽样的 R 包有 sampling 与 survey,如果可能我会在本系列下另建文章介绍。\n", "\n", "### SQL语句\n", "\n", "在 R 中,借助 sqldf 包可以直接用 SQL 语句操作数据框(data.frame)。一个来自书中的例子:\n", "\n", "```r\n", "newdf <- sqldf(\"select * from mtcars where carb=1 order by mpg\", row.names=TRUE)\n", "```\n", "\n", "这里就不过多涉及了。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 字符串处理\n", "\n", "R 中的字符串处理函数有以下几种:\n", "\n", "### 通用函数\n", "\n", "| 函数 | 含义 |\n", "| --- | --- |\n", "| nchar(x) | 计算字符串的长度 |\n", "| substr(x, start, stop) | 提取子字符串 |\n", "| grep(pattern, x, ignore.case=FALSE, fixed=FALSE) | 正则搜索,返回为匹配的下标。如果 fixed=T,则按字符串而不是正则搜索。 |\n", "| grepl() | 类似 grep(),只不过返回值是逻辑值向量。 |\n", "| sub(pattern, replacement, x, ignore.base=FALSE, fixed=FALSE) | 在 x 中搜索正则式,并以 replacement 将其替换。如果 fixed=T,则按字符串而不是正则搜索 |\n", "| strsplit(x, split, fixed=FALSE) | 在 split 处分割字符向量 x 中的元素,返回一个列表。 |\n", "| paste(x1, x2, ..., sep=\"\") | 连接字符串,连接符为 sep。也可以连接重复字串:`paste(\"x\", 1:3, sep=\"\")` |\n", "| toupper(x) | 转换字符串为全大写 |\n", "| tolower(x) | 转换字符串为全小写 |\n", "\n", "一些例子。首先是正则表达式的使用:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1]]\n", "[1] 1 2 3 4\n", "\n", "[[2]]\n", "[1] 4\n", "\n", "[[3]]\n", "[1] \"Hey\" \"Hey\" \"Hey\" \"Hey5\"\n", "\n", "[[4]]\n", "[1] \"abc\" \"abcc\" \"abccc\" \"NEW\" \n", "\n" ] } ], "source": [ "streg <- c(\"abc\", \"abcc\", \"abccc\", \"abc5\")\n", "re1 <- grep(\"abc*\", streg)\n", "re2 <- grep(\"abc\\\\d\", streg) # 注意反斜杠要双写来在 R 中转义\n", "re3 <- sub(\"[a-z]*\", \"Hey\", streg)\n", "re4 <- sub(\"[a-z]*\\\\d\", \"NEW\", streg)\n", "\n", "print(list(re1, re2, re3, re4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "然后是字符串分割与连接。注意这里的 paste() 有非常巧妙的用法:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1]]\n", "[[1]][[1]]\n", "[1] \"ab\"\n", "\n", "[[1]][[2]]\n", "[1] \"ab\" \"\" \n", "\n", "[[1]][[3]]\n", "[1] \"ab\" \"\" \"\" \n", "\n", "[[1]][[4]]\n", "[1] \"ab\" \"5\" \n", "\n", "\n", "[[2]]\n", "[1] \"a-b-c\"\n", "\n", "[[3]]\n", "[1] \"x1\" \"x2\" \"x3\"\n", "\n" ] } ], "source": [ "splt <- strsplit(streg, \"c\") # 结果中不含分隔符 \"c\"\n", "cat1 <- paste(\"a\", \"b\", \"c\", sep=\"-\")\n", "cat2 <- paste(\"x\", 1:3, sep=\"\") # 生成列名时非常有用\n", "\n", "print(list(splt, cat1, cat2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 日期型字符串\n", "\n", "与其他类型相似,日期型字符串能够通过 as.Date() 函数处理。各格式字符的含义如下:\n", "\n", "| 符号 | 含义 | 通用示例 | 中文示例 |\n", "| --- | --- | --- | --- |\n", "| %d | 日(1~31) | 22 | 22 |\n", "| %a | 缩写星期 | Mon | 周一 |\n", "| %A | 全写星期 | Monday | 星期一 |\n", "| %m | 月(1~12) | 10 | 10 |\n", "| %b | 缩写月 | Jan | 1月 |\n", "| %B | 全写月 | January | 一月 |\n", "| %y | 两位年 | 17 | 17 |\n", "| %Y | 四位年 | 2017 | 2017 |" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"2017-01-28\"\n" ] } ], "source": [ "# 对字符串数据 x,用法:as.Date(x, format=, ...)\n", "dates <- as.Date(\"01-28-2017\", format=\"%m-%d-%Y\")\n", "print(dates)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "要想获得当前的日期或时间,有两种格式可以参考,并可以用 format() 函数辅助输出。" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "'星期六'" ], "text/latex": [ "'星期六'" ], "text/markdown": [ "'星期六'" ], "text/plain": [ "[1] \"星期六\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Sys.Date() 返回一个精确到日的标准日期格式\n", "dates1 <- Sys.Date()\n", "format(dates1, format=\"%A\") # 可以指定输出格式" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "'Sat Apr 22 15:30:54 2017'" ], "text/latex": [ "'Sat Apr 22 15:30:54 2017'" ], "text/markdown": [ "'Sat Apr 22 15:30:54 2017'" ], "text/plain": [ "[1] \"Sat Apr 22 15:30:54 2017\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# date() 返回一个精确到秒的详细的字串\n", "dates2 <- date()\n", "dates2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "函数 difftime() 提供了计算时间差的方式。其中计量单位可以是以下之一:\"auto\", \"secs\", \"mins\", \"hours\", \"days\", \"weeks\"。\n", "\n", "截至本文最后更新,我有 1100+ 周大。唔……这好像听起来没什么感觉" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Time difference of 1169.429 weeks" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dates1 <- as.Date(\"1994-11-23\")\n", "dates2 <- Sys.Date()\n", "difftime(dates2, dates1, units=\"weeks\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 异常值处理\n", "\n", "异常值包括三类:\n", "\n", "- NA:缺失值。\n", "- Inf:正无穷。用 -Inf 表示负无穷。**无穷与数可以比较大小,**比如 -Inf < 3 为真。\n", "- NaN:非可能值。比如 0/0。\n", "\n", "使用 is.na() 函数判断数据集中是否存在 NA 或者 NaN,并返回矩阵。注意 NaN 会被判断为缺失值。" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
agegender
FALSEFALSE
FALSEFALSE
FALSEFALSE
FALSEFALSE
FALSEFALSE
\n" ], "text/latex": [ "\\begin{tabular}{ll}\n", " age & gender\\\\\n", "\\hline\n", "\t FALSE & FALSE\\\\\n", "\t FALSE & FALSE\\\\\n", "\t FALSE & FALSE\\\\\n", "\t FALSE & FALSE\\\\\n", "\t FALSE & FALSE\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "age | gender | \n", "|---|---|---|---|---|\n", "| FALSE | FALSE | \n", "| FALSE | FALSE | \n", "| FALSE | FALSE | \n", "| FALSE | FALSE | \n", "| FALSE | FALSE | \n", "\n", "\n" ], "text/plain": [ " age gender\n", "[1,] FALSE FALSE \n", "[2,] FALSE FALSE \n", "[3,] FALSE FALSE \n", "[4,] FALSE FALSE \n", "[5,] FALSE FALSE " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "is.na(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另外也有类似的函数来判断 Inf 与 NaN,但只能对一维数据集使用:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] TRUE TRUE FALSE\n" ] } ], "source": [ "print(c(is.infinite(c(Inf, -Inf)), is.nan(NA)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在进行数据处理之前,处理 NA 缺失值是必须的步骤。如果某些数值过于离群,你也可能需要将其标记为 NA 。行移除是最简单粗暴的处理方法。" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
agegender
22 Male
37 Female
28 Male
33 Female
43 Male
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " age & gender\\\\\n", "\\hline\n", "\t 22 & Male \\\\\n", "\t 37 & Female\\\\\n", "\t 28 & Male \\\\\n", "\t 33 & Female\\\\\n", "\t 43 & Male \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "age | gender | \n", "|---|---|---|---|---|\n", "| 22 | Male | \n", "| 37 | Female | \n", "| 28 | Male | \n", "| 33 | Female | \n", "| 43 | Male | \n", "\n", "\n" ], "text/plain": [ " age gender\n", "1 22 Male \n", "2 37 Female\n", "3 28 Male \n", "4 33 Female\n", "5 43 Male " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# NA 行移除\n", "df <- na.omit(df)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 整合与重构\n", "\n", "### 转置\n", "\n", "常见的转置方法是 t() 函数:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
12
34
56
\n" ], "text/latex": [ "\\begin{tabular}{ll}\n", "\t 1 & 2\\\\\n", "\t 3 & 4\\\\\n", "\t 5 & 6\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| 1 | 2 | \n", "| 3 | 4 | \n", "| 5 | 6 | \n", "\n", "\n" ], "text/plain": [ " [,1] [,2]\n", "[1,] 1 2 \n", "[2,] 3 4 \n", "[3,] 5 6 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df = matrix(1:6, nrow=2, ncol=3)\n", "t(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 整合:aggregate()\n", "\n", "这个函数是非常强大的。语法:\n", "\n", " aggregate(x, by=list(), FUN)\n", " \n", "其中 x 是待整合的数据对象,by 是分类依据的列,FUN 是待应用的标量函数。" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
b1b2v1v2
1 95 5.0 55
2 95 7.0 77
1 99 5.5 55
2 99 NA NA
big damp3.0 33
bluedry 3.0 33
red red 4.0 44
red wet 1.0 11
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " b1 & b2 & v1 & v2\\\\\n", "\\hline\n", "\t 1 & 95 & 5.0 & 55 \\\\\n", "\t 2 & 95 & 7.0 & 77 \\\\\n", "\t 1 & 99 & 5.5 & 55 \\\\\n", "\t 2 & 99 & NA & NA \\\\\n", "\t big & damp & 3.0 & 33 \\\\\n", "\t blue & dry & 3.0 & 33 \\\\\n", "\t red & red & 4.0 & 44 \\\\\n", "\t red & wet & 1.0 & 11 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "b1 | b2 | v1 | v2 | \n", "|---|---|---|---|---|---|---|---|\n", "| 1 | 95 | 5.0 | 55 | \n", "| 2 | 95 | 7.0 | 77 | \n", "| 1 | 99 | 5.5 | 55 | \n", "| 2 | 99 | NA | NA | \n", "| big | damp | 3.0 | 33 | \n", "| blue | dry | 3.0 | 33 | \n", "| red | red | 4.0 | 44 | \n", "| red | wet | 1.0 | 11 | \n", "\n", "\n" ], "text/plain": [ " b1 b2 v1 v2\n", "1 1 95 5.0 55\n", "2 2 95 7.0 77\n", "3 1 99 5.5 55\n", "4 2 99 NA NA\n", "5 big damp 3.0 33\n", "6 blue dry 3.0 33\n", "7 red red 4.0 44\n", "8 red wet 1.0 11" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 这个例子改编自 R 的官方帮助 aggregate()\n", "df <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,6,7,9),\n", " v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )\n", "by1 <- c(\"red\", \"blue\", 1, 2, NA, \"big\", 1, 2, \"red\", 1, NA, 12)\n", "by2 <- c(\"wet\", \"dry\", 99, 95, NA, \"damp\", 95, 99, \"red\", 99, NA, NA)\n", "\n", "# 按照 by1 & by2 整合原数据 testDF\n", "# 注意(by1, by2)=(1, 99) 对应 (v1, v2)=(5, 55) 与 (6,55) 两条数据\n", "# 因此第三行的 v1 = mean(c(5, 6)) = 5.5\n", "aggregate(x = df, by = list(b1=by1, b2=by2), FUN = \"mean\")" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
by1by2V1
1 95 5.0
2 95 7.0
1 99 5.5
big damp3.0
bluedry 3.0
red red 4.0
red wet 1.0
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " by1 & by2 & V1\\\\\n", "\\hline\n", "\t 1 & 95 & 5.0 \\\\\n", "\t 2 & 95 & 7.0 \\\\\n", "\t 1 & 99 & 5.5 \\\\\n", "\t big & damp & 3.0 \\\\\n", "\t blue & dry & 3.0 \\\\\n", "\t red & red & 4.0 \\\\\n", "\t red & wet & 1.0 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "by1 | by2 | V1 | \n", "|---|---|---|---|---|---|---|\n", "| 1 | 95 | 5.0 | \n", "| 2 | 95 | 7.0 | \n", "| 1 | 99 | 5.5 | \n", "| big | damp | 3.0 | \n", "| blue | dry | 3.0 | \n", "| red | red | 4.0 | \n", "| red | wet | 1.0 | \n", "\n", "\n" ], "text/plain": [ " by1 by2 V1 \n", "1 1 95 5.0\n", "2 2 95 7.0\n", "3 1 99 5.5\n", "4 big damp 3.0\n", "5 blue dry 3.0\n", "6 red red 4.0\n", "7 red wet 1.0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 用公式筛选原数据的列,仅整合这些列\n", "# 注意:v1中的一个含 NA 的观测被移除\n", "aggregate(cbind(df$v1) ~ by1+by2, FUN = \"mean\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "还有一个强大的整合包 reshape2,这里就不多介绍了。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 函数式编程\n", "\n", "函数式编程是每个科学计算语言中的重要内容;操作实现的优先级依次是**矢量运算(例如 df+1)、函数式书写,最后才是循环语句**。在 R 中,函数式编程主要是由 apply 函数族承担。R 中的 apply 函数族包括:\n", "\n", "- apply():指定轴向。传入 data.frame,返回 vector.\n", "- tapply():\n", "- vapply():\n", "- lapply():\n", "- sapply():\n", "- mapply():\n", "- rapply():\n", "- eapply():\n", "\n", "下面依次介绍。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### apply():指定多维对象的轴\n", "\n", "在 R 中,通过 apply() 可以将函数运用于多维对象。基本语法是:\n", "\n", " apply(d, N, FUN, ...)\n", "\n", "其中,N 用于指定将函数 FUN 应用于数据 d 的第几维(1为行,2为列)。省略号中可以传入 function 的参数。" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
xyzs
1583
2467
3294
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " x & y & z & s\\\\\n", "\\hline\n", "\t 1 & 5 & 8 & 3\\\\\n", "\t 2 & 4 & 6 & 7\\\\\n", "\t 3 & 2 & 9 & 4\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x | y | z | s | \n", "|---|---|---|\n", "| 1 | 5 | 8 | 3 | \n", "| 2 | 4 | 6 | 7 | \n", "| 3 | 2 | 9 | 4 | \n", "\n", "\n" ], "text/plain": [ " x y z s\n", "1 1 5 8 3\n", "2 2 4 6 7\n", "3 3 2 9 4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df <- data.frame(x=c(1, 2, 3), y=c(5, 4, 2), z=c(8, 6, 9), s=c(3, 7, 4))\n", "df" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1]]\n", "x y z s \n", "2 4 8 4 \n", "\n", "[[2]]\n", "[1] 2.50 3.50 2.75\n", "\n" ] } ], "source": [ "# 计算 df 各列的中位数\n", "colmean <- apply(df, 2, median)\n", "# 计算 df 各行的 25 分位数\n", "rowquan <- apply(df, 1, quantile, probs=.25)\n", "\n", "print(list(colmean, rowquan))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### lapply():列表式应用\n", "\n", "lapply 函数的本意是对 list 对象进行操作。返回值是 list 类型。" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\t
$a
\n", "\t\t
1
\n", "\t
$b
\n", "\t\t
5
\n", "\t
$c
\n", "\t\t
25
\n", "
\n" ], "text/latex": [ "\\begin{description}\n", "\\item[\\$a] 1\n", "\\item[\\$b] 5\n", "\\item[\\$c] 25\n", "\\end{description}\n" ], "text/markdown": [ "$a\n", ": 1\n", "$b\n", ": 5\n", "$c\n", ": 25\n", "\n", "\n" ], "text/plain": [ "$a\n", "[1] 1\n", "\n", "$b\n", "[1] 5\n", "\n", "$c\n", "[1] 25\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "lst <- list(a=c(0,1), b=c(1,2), c=c(3,4))\n", "lapply(lst, function(x) {sum(x^2)})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "但同样可以作用于 DataFrame 对象的各个列(因为 DataFrame 对象是类似于各列组成的 list):" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\t
$x
\n", "\t\t
6
\n", "\t
$y
\n", "\t\t
11
\n", "\t
$z
\n", "\t\t
23
\n", "\t
$s
\n", "\t\t
14
\n", "
\n" ], "text/latex": [ "\\begin{description}\n", "\\item[\\$x] 6\n", "\\item[\\$y] 11\n", "\\item[\\$z] 23\n", "\\item[\\$s] 14\n", "\\end{description}\n" ], "text/markdown": [ "$x\n", ": 6\n", "$y\n", ": 11\n", "$z\n", ": 23\n", "$s\n", ": 14\n", "\n", "\n" ], "text/plain": [ "$x\n", "[1] 6\n", "\n", "$y\n", "[1] 11\n", "\n", "$z\n", "[1] 23\n", "\n", "$s\n", "[1] 14\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "lapply(df, sum)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### sapply()/vapply():变种 lapply()\n", "\n", "sapply() 实质上是一种异化的 lapply(),返回值可以转变为 vector 而不是 list 类型。 " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "'numeric'" ], "text/latex": [ "'numeric'" ], "text/markdown": [ "'numeric'" ], "text/plain": [ "[1] \"numeric\"" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "'list'" ], "text/latex": [ "'list'" ], "text/markdown": [ "'list'" ], "text/plain": [ "[1] \"list\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "class(sapply(lst, function(x) {sum(x^2)}))\n", "class(lapply(lst, function(x) {sum(x^2)}))" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " x y z s \n", " 6 11 23 14 \n" ] } ], "source": [ "print(sapply(df, sum))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "参数 simplify=TRUE 是默认值,表示返回 vector 而不是 list。如果改为 FALSE,就退化为 lapply() 函数。" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\t
$x
\n", "\t\t
6
\n", "\t
$y
\n", "\t\t
11
\n", "\t
$z
\n", "\t\t
23
\n", "\t
$s
\n", "\t\t
14
\n", "
\n" ], "text/latex": [ "\\begin{description}\n", "\\item[\\$x] 6\n", "\\item[\\$y] 11\n", "\\item[\\$z] 23\n", "\\item[\\$s] 14\n", "\\end{description}\n" ], "text/markdown": [ "$x\n", ": 6\n", "$y\n", ": 11\n", "$z\n", ": 23\n", "$s\n", ": 14\n", "\n", "\n" ], "text/plain": [ "$x\n", "[1] 6\n", "\n", "$y\n", "[1] 11\n", "\n", "$z\n", "[1] 23\n", "\n", "$s\n", "[1] 14\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sapply(df, sum, simplify=FALSE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "vapply() 函数可以通过 FUN.VALUE 参数传入行名称,但这一步往往可以借助 lapply()/sapply() 加上外部的 row.names() 函数完成。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### mapply():多输入值的应用\n", "\n", "mapply() 函数支持多个输入值:\n", "\n", " mapply(FUN, [input1, input2, ...], MoreArgs=NULL)\n", " \n", "其中各 input 的**长度应该相等或互为整数倍数**。该函数的用处在于避免了事先将数据合并。" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [1] -2.0 -1.0 0.0 1.0 2.0 0.0 0.5 1.0 1.5 2.0\n" ] } ], "source": [ "print(mapply(min, seq(0, 2, by=0.5), -2:7))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### tapply():分组应用\n", "\n", "tapply() 函数可以借助 factor 的各水平进行分组,然后进行计算。类似于 group by 操作:\n", "\n", " tapply(X, idx, FUN)\n", "\n", "其中 X 是数据,idx 是分组依据。" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "$a\n", "[1] 1 4 9\n", "\n", "$b\n", "[1] 2 6 12\n", "\n" ] } ], "source": [ "df <- data.frame(x=1:6, groups=rep(c(\"a\", \"b\"), 3))\n", "print(tapply(df$x, df$groups, cumsum))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "其他的 apply() 函数很少用到,在此就不介绍了。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 其他实用函数\n", "\n", "在本系列的 [“数据读写操作”一文](ReadData.ipynb) 中,也介绍了一些实用的函数,可以参考。\n", "\n", "此外还有:\n", "\n", "| 函数 | 含义 |\n", "| --- | --- |\n", "| seq(from=N, to=N, by=N, [length.out=N, along.with=obj]) | 生成数列。参数分别是起、止、步长、数列长、指定数列长度与某对象等长。 |\n", "| rep(x, N) | 重复组合。比如 rep(1:2, 2) 会生成一个向量 c(1, 2, 1, 2) |\n", "| cut(x, N, [ordered_result=F]) | 分割为因子。 将连续变量 x 分割为有 N 个水平的因子,可以指定是否有序。 | \n", "| pretty(x, N) | 美观分割。将连续变量 x 分割为 N 个区间(N+1 个端点),并使端点为取整值。 绘图中使用。|\n", "| cat(obj1, obj2, ..., [file=, append=]) | 连接多个对象,并输出到屏幕或文件。 |" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.0.0" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }