Categorical distribution

Story

A probability is assigned to each of a set of discrete unordered outcomes.

Example

A hen will peck at grain A with probability \(\theta_\mathrm{A}\), grain B with probability \(\theta_\mathrm{B}\), and grain C with probability \(\theta_\mathrm{C}\).

Parameters

The distribution is parametrized by the probabilities assigned to each event. We define \(\theta_y\) to be the probability assigned to outcome \(y\). The set of \(\theta_y\)’s are the parameters, and are constrained by

\[\begin{align} \sum_y \theta_y = 1. \end{align}\]

Support

If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to N, inclusive when described using the indices of the categories.

Probability mass function

\[\begin{align} f(y;\{\theta_y\}) = \theta_y \end{align}\]

Cumulative distribution function

If we choose to impose an order on the categories, we can define a CDF. Here, we denote the index by \(y\).

\[\begin{align} F(y;\{\theta_y\}) = \sum_{y'\le y} \theta_y. \end{align}\]

Moments

Moments are not defined for a Categorical distribution because the value of \(y\) is not necessarily numeric.

Usage

Package	Syntax
NumPy	`rng.choice(len(theta), p=theta)`
SciPy	`scipy.stats.rv_discrete(values=(range(len(theta)), theta))`
Distributions.jl	`Categorical(theta)`
Stan	`categorical(theta)`

Notes

This distribution must be manually constructed if you are using the scipy.stats module using scipy.stats.rv_discrete(). The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF.
To sample out of a Categorical distribution, use rng.choice(), specifying the values of \(\theta\) using the p kwarg.

PMF and CDF plots

In the plot below, there are four categories, but we can only specify the probability for three of the four categories because the probability of the fourth is set by the normalization condition. If the parameters are such that the probabilities of the respective categories exceed one, the PMF is not displayed.