What Is Statistics? Descriptive, Inferential, Probability, and the Science of Data

What Is Statistics?

Statistics is the science of collecting, organizing, analyzing, and interpreting numerical data to draw conclusions about real-world phenomena. It provides the mathematical tools to extract meaningful patterns from data, quantify uncertainty, and make evidence-based decisions when certainty is impossible.

Statistics permeates modern life: every poll, clinical drug trial, economic report, sports performance metric, actuarial calculation, and scientific finding relies on statistical methods. Statistical literacy — the ability to interpret and critically evaluate statistical claims — has become an essential skill for navigating a data-saturated world and evaluating the flood of studies, polls, and data visualizations in daily news.

Descriptive Statistics: Summarizing Data

Descriptive statistics summarize and describe the main features of a dataset without making inferences beyond the data at hand.

Measures of Central Tendency

Mean (arithmetic average): Sum of values divided by count. Sensitive to outliers — a single extreme value can dramatically shift the mean. The mean U.S. household income (~$105,000) is substantially higher than median household income (~$74,000) because a few very high-income households pull the mean upward.
Median: The middle value when data are sorted; the 50th percentile. Resistant to outliers; preferred for skewed distributions (income, housing prices).
Mode: The most frequently occurring value. Useful for categorical data (most common blood type, most popular car color).

Measures of Spread

Range: Maximum minus minimum. Simple but sensitive to outliers.
Standard deviation (SD): The average distance of data points from the mean; the most common measure of spread. For a normal distribution, approximately 68% of data falls within 1 SD of the mean, 95% within 2 SDs, and 99.7% within 3 SDs.
Interquartile range (IQR): The range of the middle 50% of data (25th to 75th percentile). Resistant to outliers; used in box plots.

Data Visualization

Statistical visualization is fundamental to understanding data: histograms show the distribution shape; scatter plots reveal relationships between two variables; box plots display central tendency, spread, and outliers; bar charts compare categories. Edward Tufte's work on data visualization has emphasized maximizing the data-to-ink ratio — showing the most information with the least visual clutter.

Probability: The Foundation of Inference

Statistics is built on probability theory — the mathematical framework for quantifying uncertainty. Probability ranges from 0 (impossible) to 1 (certain). Key concepts:

Probability distributions: Functions describing the likelihood of different outcomes. The normal (Gaussian) distribution — the bell curve — appears widely in nature and statistics because of the Central Limit Theorem: the mean of any sufficiently large random sample is approximately normally distributed, regardless of the underlying distribution.
Conditional probability: The probability of event A given that event B has occurred (P(A|B)). Bayes' theorem: P(A|B) = P(B|A) × P(A) / P(B). Bayesian reasoning — updating beliefs in light of evidence — is a core statistical framework increasingly used in machine learning and clinical medicine.
The law of large numbers: As sample size increases, sample statistics converge to population parameters. Explains why large clinical trials are more reliable than small ones.

Inferential Statistics: Drawing Conclusions from Samples

Inferential statistics uses sample data to make inferences about a population — with quantified uncertainty. Because we rarely have access to an entire population (every U.S. voter, every patient with a disease), we draw samples and use statistical methods to estimate population parameters and test hypotheses.

Hypothesis Testing

Hypothesis testing is the formal framework for evaluating whether observed data provide evidence against a null hypothesis:

Null hypothesis (H₀): The default assumption — typically that there is no effect, no difference, or no relationship
Alternative hypothesis (H₁): The claim being tested (e.g., the drug reduces blood pressure)
Test statistic: A number calculated from sample data that measures how far the sample is from what's expected under H₀
P-value: The probability of observing a test statistic at least as extreme as the one calculated, assuming H₀ is true. A small p-value (typically < 0.05) suggests the data are unlikely under H₀ — evidence against it.
Decision: Reject H₀ (if p < α threshold) or fail to reject H₀

Understanding P-Values

The p-value is one of the most misunderstood concepts in statistics. Common misconceptions:

Wrong: "The p-value is the probability that the null hypothesis is true" (it is not; it is the probability of the data given the null hypothesis)
Wrong: "p < 0.05 means the result is practically important" (a trivially small effect can have p < 0.00001 with a large enough sample)
Wrong: "p > 0.05 means no effect exists" (absence of evidence is not evidence of absence)

The arbitrary threshold of p < 0.05 (introduced by Ronald Fisher in the 1920s as a rule of thumb, not a scientific law) has caused enormous problems in research reproducibility. The American Statistical Association issued a 2016 statement emphasizing that statistical significance is not the same as scientific importance, and that context, effect size, and prior evidence must be considered.

Confidence Intervals

A 95% confidence interval provides a range of values that, if the same study were repeated many times, would contain the true parameter in 95% of repetitions. A drug that reduced blood pressure by 5 mmHg (95% CI: 2–8 mmHg) provides more information than just reporting p < 0.05 — the interval shows both the direction and plausible magnitude of the effect.

Correlation and Causation

One of statistics' most important lessons: correlation does not imply causation. A correlation coefficient (r) measures the linear relationship between two variables (-1 to +1). Ice cream sales and drowning deaths are positively correlated — both peak in summer. The underlying cause is a confounding variable: hot weather increases both activities.

Establishing causality requires either: randomized controlled experiments (randomly assign subjects to treatment/control, eliminating confounding), or sophisticated observational methods (instrumental variables, difference-in-differences, regression discontinuity) designed to approximate random assignment. The counterfactual framework — what would have happened in the absence of treatment — is the conceptual basis for causal inference.