{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# R中的描述性统计\n", "\n", "本文展示了 R 语言中基础的描述性统计相关的内容。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 描述性统计\n", "\n", "最简单的是 summary() 函数,给出数值变量的的最值、四分位值、中位数(这五个又称为五位数总括,可以用 fivenum() 函数单独调用),以及均值;非数值变量的频数统计。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
xyf
121
232
343
231
342
453
451
562
673
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " x & y & f\\\\\n", "\\hline\n", "\t 1 & 2 & 1\\\\\n", "\t 2 & 3 & 2\\\\\n", "\t 3 & 4 & 3\\\\\n", "\t 2 & 3 & 1\\\\\n", "\t 3 & 4 & 2\\\\\n", "\t 4 & 5 & 3\\\\\n", "\t 4 & 5 & 1\\\\\n", "\t 5 & 6 & 2\\\\\n", "\t 6 & 7 & 3\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "x | y | f | \n", "|---|---|---|---|---|---|---|---|---|\n", "| 1 | 2 | 1 | \n", "| 2 | 3 | 2 | \n", "| 3 | 4 | 3 | \n", "| 2 | 3 | 1 | \n", "| 3 | 4 | 2 | \n", "| 4 | 5 | 3 | \n", "| 4 | 5 | 1 | \n", "| 5 | 6 | 2 | \n", "| 6 | 7 | 3 | \n", "\n", "\n" ], "text/plain": [ " x y f\n", "1 1 2 1\n", "2 2 3 2\n", "3 3 4 3\n", "4 2 3 1\n", "5 3 4 2\n", "6 4 5 3\n", "7 4 5 1\n", "8 5 6 2\n", "9 6 7 3" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dt <- data.frame(x=c(seq(1, 3), seq(2, 4), seq(4, 6))) \n", "dt$y <- dt$x + 1\n", "dt$f <- as.factor(rep(c(1, 2, 3), 3))\n", "dt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " x y f \n", " Min. :1.000 Min. :2.000 1:3 \n", " 1st Qu.:2.000 1st Qu.:3.000 2:3 \n", " Median :3.000 Median :4.000 3:3 \n", " Mean :3.333 Mean :4.333 \n", " 3rd Qu.:4.000 3rd Qu.:5.000 \n", " Max. :6.000 Max. :7.000 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "summary(dt)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 1 2 3 4 6\n" ] } ], "source": [ "print(fivenum(dt$x))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在 [这篇文章](ManageData.ipynb#整合:aggregate()) 中介绍了利用 aggregate() 函数对二维数据进行分组统计的方法。不过该函数只能调用单返回值的统计函数,如果要调用多返回值的,请使用 by() 函数:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dt$f: 1\n", " x y\n", "Min 1 2\n", "Max 4 5\n", "------------------------------------------------------------ \n", "dt$f: 2\n", " x y\n", "Min 2 3\n", "Max 5 6\n", "------------------------------------------------------------ \n", "dt$f: 3\n", " x y\n", "Min 3 4\n", "Max 6 7" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mystats <- function(x) {\n", " return(c(Min=min(x), Max=max(x)))\n", "}\n", "# 分别统计列 x 与 列 y 中在 f 列各水平下的最值\n", "by(dt[,c(\"x\",\"y\")], dt$f, function(data) sapply(data, mystats))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 频数表\n", "\n", "函数 table() 与 prop.table() 分别统计频数或频率:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1]]\n", "\n", "1 2 3 4 5 6 \n", "1 2 2 2 1 1 \n", "\n", "[[2]]\n", "[1] 0.03 0.07 0.10 0.07 0.10 0.13 0.13 0.17 0.20\n", "\n" ] } ], "source": [ "print(list(table(dt$x), round(prop.table(dt$x), digits=2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "此外,如果要制作二维列联表,使用 table(A, B) 或者 xtabs(~ A + B, data=) 函数。其中 A 是行, B是列。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " y\n", "x 2 3 4\n", " 1 1 0 0\n", " 2 0 1 1\n", " 3 1 1 0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "tmp <- data.frame(x=c(1, 2, 2, 3, 3), y=c(2, 3, 4, 3, 2))\n", "ct <- xtabs(~x+y, data=tmp)\n", "ct" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
234Sum
11001
20112
31102
Sum2215
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " & 2 & 3 & 4 & Sum\\\\\n", "\\hline\n", "\t1 & 1 & 0 & 0 & 1\\\\\n", "\t2 & 0 & 1 & 1 & 2\\\\\n", "\t3 & 1 & 1 & 0 & 2\\\\\n", "\tSum & 2 & 2 & 1 & 5\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| | 2 | 3 | 4 | Sum | \n", "|---|---|---|---|\n", "| 1 | 1 | 0 | 0 | 1 | \n", "| 2 | 0 | 1 | 1 | 2 | \n", "| 3 | 1 | 1 | 0 | 2 | \n", "| Sum | 2 | 2 | 1 | 5 | \n", "\n", "\n" ], "text/plain": [ " y\n", "x 2 3 4 Sum\n", " 1 1 0 0 1 \n", " 2 0 1 1 2 \n", " 3 1 1 0 2 \n", " Sum 2 2 1 5 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 为列联表添加边际和;也可以通过 addmargins(ct, 1/2) 只累加列/行\n", "addmargins(ct)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "边际频数使用 margin.table() 进行计算,比例使用 prop.table() 进行计算。参数 1 表示行,2 表示列。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "x\n", "1 2 3 \n", "1 2 2 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 行内总计(每行累加)\n", "margin.table(ct, 1)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " y\n", "x 2 3 4\n", " 1 1.0 0.0 0.0\n", " 2 0.0 0.5 0.5\n", " 3 0.5 0.5 0.0" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# 行内比例(每行累加为 1)\n", "prop.table(ct, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 独立性检验:卡方\n", "\n", "这里介绍卡方 $\\chi^2$ 独立性检验。本例中,p = 0.44 > 0.05,接受了相互独立的假设,即认为它们是独立的。" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning message in chisq.test(ct):\n", "\"Chi-squared approximation may be incorrect\"" ] }, { "data": { "text/plain": [ "\n", "\tPearson's Chi-squared test\n", "\n", "data: ct\n", "X-squared = 3.75, df = 4, p-value = 0.4409\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "chisq.test(ct)" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.0.0" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }