{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# R中的描述性统计\n",
"\n",
"本文展示了 R 语言中基础的描述性统计相关的内容。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 描述性统计\n",
"\n",
"最简单的是 summary() 函数,给出数值变量的的最值、四分位值、中位数(这五个又称为五位数总括,可以用 fivenum() 函数单独调用),以及均值;非数值变量的频数统计。"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"x | y | f |
\n",
"\n",
"\t1 | 2 | 1 |
\n",
"\t2 | 3 | 2 |
\n",
"\t3 | 4 | 3 |
\n",
"\t2 | 3 | 1 |
\n",
"\t3 | 4 | 2 |
\n",
"\t4 | 5 | 3 |
\n",
"\t4 | 5 | 1 |
\n",
"\t5 | 6 | 2 |
\n",
"\t6 | 7 | 3 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" x & y & f\\\\\n",
"\\hline\n",
"\t 1 & 2 & 1\\\\\n",
"\t 2 & 3 & 2\\\\\n",
"\t 3 & 4 & 3\\\\\n",
"\t 2 & 3 & 1\\\\\n",
"\t 3 & 4 & 2\\\\\n",
"\t 4 & 5 & 3\\\\\n",
"\t 4 & 5 & 1\\\\\n",
"\t 5 & 6 & 2\\\\\n",
"\t 6 & 7 & 3\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"x | y | f | \n",
"|---|---|---|---|---|---|---|---|---|\n",
"| 1 | 2 | 1 | \n",
"| 2 | 3 | 2 | \n",
"| 3 | 4 | 3 | \n",
"| 2 | 3 | 1 | \n",
"| 3 | 4 | 2 | \n",
"| 4 | 5 | 3 | \n",
"| 4 | 5 | 1 | \n",
"| 5 | 6 | 2 | \n",
"| 6 | 7 | 3 | \n",
"\n",
"\n"
],
"text/plain": [
" x y f\n",
"1 1 2 1\n",
"2 2 3 2\n",
"3 3 4 3\n",
"4 2 3 1\n",
"5 3 4 2\n",
"6 4 5 3\n",
"7 4 5 1\n",
"8 5 6 2\n",
"9 6 7 3"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"dt <- data.frame(x=c(seq(1, 3), seq(2, 4), seq(4, 6))) \n",
"dt$y <- dt$x + 1\n",
"dt$f <- as.factor(rep(c(1, 2, 3), 3))\n",
"dt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" x y f \n",
" Min. :1.000 Min. :2.000 1:3 \n",
" 1st Qu.:2.000 1st Qu.:3.000 2:3 \n",
" Median :3.000 Median :4.000 3:3 \n",
" Mean :3.333 Mean :4.333 \n",
" 3rd Qu.:4.000 3rd Qu.:5.000 \n",
" Max. :6.000 Max. :7.000 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"summary(dt)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] 1 2 3 4 6\n"
]
}
],
"source": [
"print(fivenum(dt$x))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在 [这篇文章](ManageData.ipynb#整合:aggregate()) 中介绍了利用 aggregate() 函数对二维数据进行分组统计的方法。不过该函数只能调用单返回值的统计函数,如果要调用多返回值的,请使用 by() 函数:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dt$f: 1\n",
" x y\n",
"Min 1 2\n",
"Max 4 5\n",
"------------------------------------------------------------ \n",
"dt$f: 2\n",
" x y\n",
"Min 2 3\n",
"Max 5 6\n",
"------------------------------------------------------------ \n",
"dt$f: 3\n",
" x y\n",
"Min 3 4\n",
"Max 6 7"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"mystats <- function(x) {\n",
" return(c(Min=min(x), Max=max(x)))\n",
"}\n",
"# 分别统计列 x 与 列 y 中在 f 列各水平下的最值\n",
"by(dt[,c(\"x\",\"y\")], dt$f, function(data) sapply(data, mystats))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 频数表\n",
"\n",
"函数 table() 与 prop.table() 分别统计频数或频率:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1]]\n",
"\n",
"1 2 3 4 5 6 \n",
"1 2 2 2 1 1 \n",
"\n",
"[[2]]\n",
"[1] 0.03 0.07 0.10 0.07 0.10 0.13 0.13 0.17 0.20\n",
"\n"
]
}
],
"source": [
"print(list(table(dt$x), round(prop.table(dt$x), digits=2)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"此外,如果要制作二维列联表,使用 table(A, B) 或者 xtabs(~ A + B, data=) 函数。其中 A 是行, B是列。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" y\n",
"x 2 3 4\n",
" 1 1 0 0\n",
" 2 0 1 1\n",
" 3 1 1 0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"tmp <- data.frame(x=c(1, 2, 2, 3, 3), y=c(2, 3, 4, 3, 2))\n",
"ct <- xtabs(~x+y, data=tmp)\n",
"ct"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" | 2 | 3 | 4 | Sum |
\n",
"\n",
"\t1 | 1 | 0 | 0 | 1 |
\n",
"\t2 | 0 | 1 | 1 | 2 |
\n",
"\t3 | 1 | 1 | 0 | 2 |
\n",
"\tSum | 2 | 2 | 1 | 5 |
\n",
"\n",
"
\n"
],
"text/latex": [
"\\begin{tabular}{r|llll}\n",
" & 2 & 3 & 4 & Sum\\\\\n",
"\\hline\n",
"\t1 & 1 & 0 & 0 & 1\\\\\n",
"\t2 & 0 & 1 & 1 & 2\\\\\n",
"\t3 & 1 & 1 & 0 & 2\\\\\n",
"\tSum & 2 & 2 & 1 & 5\\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| | 2 | 3 | 4 | Sum | \n",
"|---|---|---|---|\n",
"| 1 | 1 | 0 | 0 | 1 | \n",
"| 2 | 0 | 1 | 1 | 2 | \n",
"| 3 | 1 | 1 | 0 | 2 | \n",
"| Sum | 2 | 2 | 1 | 5 | \n",
"\n",
"\n"
],
"text/plain": [
" y\n",
"x 2 3 4 Sum\n",
" 1 1 0 0 1 \n",
" 2 0 1 1 2 \n",
" 3 1 1 0 2 \n",
" Sum 2 2 1 5 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 为列联表添加边际和;也可以通过 addmargins(ct, 1/2) 只累加列/行\n",
"addmargins(ct)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"边际频数使用 margin.table() 进行计算,比例使用 prop.table() 进行计算。参数 1 表示行,2 表示列。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"x\n",
"1 2 3 \n",
"1 2 2 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 行内总计(每行累加)\n",
"margin.table(ct, 1)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
" y\n",
"x 2 3 4\n",
" 1 1.0 0.0 0.0\n",
" 2 0.0 0.5 0.5\n",
" 3 0.5 0.5 0.0"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# 行内比例(每行累加为 1)\n",
"prop.table(ct, 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 独立性检验:卡方\n",
"\n",
"这里介绍卡方 $\\chi^2$ 独立性检验。本例中,p = 0.44 > 0.05,接受了相互独立的假设,即认为它们是独立的。"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Warning message in chisq.test(ct):\n",
"\"Chi-squared approximation may be incorrect\""
]
},
{
"data": {
"text/plain": [
"\n",
"\tPearson's Chi-squared test\n",
"\n",
"data: ct\n",
"X-squared = 3.75, df = 4, p-value = 0.4409\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"chisq.test(ct)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "R",
"language": "R",
"name": "ir"
},
"language_info": {
"codemirror_mode": "r",
"file_extension": ".r",
"mimetype": "text/x-r-source",
"name": "R",
"pygments_lexer": "r",
"version": "4.0.0"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}