Statistical model

A statistical model is a mathematical model that helps us understand how data is created and behaves. It is built on a set of statistical assumptions about how sample data comes from a larger population. These models are simplified versions of real processes that generate data, making it easier to study and predict patterns.

When we talk about chances or likelihoods, we use the term probabilistic model. Every statistical hypothesis test and every way to estimate values, known as a statistical estimator, comes from these models. They form the basis of statistical inference, which is how scientists make conclusions from data.

A statistical model describes the relationship between random variables, which are values that change, and other values that stay the same. It acts like a formal way to express ideas and theories about how things work, as explained by researchers like Herman Adèr and Kenneth Bollen. These models are important tools in many fields, helping us make sense of complex information.

Introduction

A statistical model is a set of assumptions that helps us understand how data might be created. For example, imagine rolling two six-sided dice. One assumption is that each number from 1 to 6 has an equal chance of appearing, like 1 out of 6. With this assumption, we can calculate the chance of both dice showing 5, which would be 1 out of 6 times 1 out of 6, or 1 out of 36.

Another assumption might say that the number 5 has a different chance, like 1 out of 8, because the dice are weighted. But this assumption alone isn’t enough to predict all possible outcomes, since we don’t know the chances for the other numbers. A good statistical model must let us calculate the probability of any event, even if it’s sometimes very hard to do.

Formal definition

A statistical model is a way to describe how data might be created using math. It has two main parts: the possible results we might see, and the different ways those results could happen based on probabilities.

Sometimes, models get more detailed. In Bayesian statistics, we add probabilities for the settings of the model. Models can also help us check if our methods work well even when our ideas about the data might be wrong.

An example

Imagine we want to understand how the age of children in a group relates to their height. If the children’s ages are spread out evenly, we might notice that older children tend to be taller. We can use a special math tool called a linear regression model to describe this relationship.

In this model, we might write an equation like height = b₀ + b₁ × age + error. Here, b₀ is a starting point, b₁ tells us how much height changes for each year of age, and the error part accounts for why some children of the same age might be a little taller or shorter than expected. This helps us make better guesses about children’s heights based on their ages.

General remarks

A statistical model is a special kind of mathematical model. Unlike other mathematical models, a statistical model includes some uncertainty. This means that some parts of the model are not fixed numbers but have probabilities, making them stochastic. For example, in a model about children's heights, a part called ε represents this uncertainty.

Statistical models are used for three main reasons: to make predictions, to find useful information from data, and to describe random patterns. These purposes help scientists understand and work with data better.

Dimension of a model

A statistical model has a dimension, which tells us how many numbers we need to describe it. For example, if we think data comes from a bell-shaped curve (called a Gaussian distribution), we need two numbers: the center (mean) and how spread out it is (standard deviation). This means the dimension is 2.

Sometimes, a model might need more numbers. If we think data points follow a straight line with some scatter, we need three numbers: where the line starts (intercept), how steep it is (slope), and how much the points scatter around the line (variance). Even though a line looks one-dimensional, the model describing it has a dimension of 3 because of these extra details.

Nested models

Not to be confused with Multilevel models.

Two statistical models are called nested when one model can be changed into another by adding limits to its settings. For example, all Gaussian distributions include those with zero average—we limit the average in the full set to get the zero-average group.

Another example is a quadratic model, which includes a linear model when we set one setting to zero. In these cases, the first model usually has more settings than the second, but this is not always true.