Categorical distribution


Story

A probability is assigned to each of a set of discrete unordered outcomes.


Example

A hen will peck at grain A with probability \(\theta_\mathrm{A}\), grain B with probability \(\theta_\mathrm{B}\), and grain C with probability \(\theta_\mathrm{C}\).


Parameters

The distribution is parametrized by the probabilities assigned to each event. We define \(\theta_y\) to be the probability assigned to outcome \(y\). The set of \(\theta_y\)’s are the parameters, and are constrained by

\[\begin{align} \sum_y \theta_y = 1. \end{align}\]

Support

If we index the categories with sequential integers from 1 to N, the distribution is supported for integers 1 to N, inclusive when described using the indices of the categories.


Probability mass function

\[\begin{align} f(y;\{\theta_y\}) = \theta_y \end{align}\]

Cumulative distribution function

If we choose to impose an order on the categories, we can define a CDF. Here, we denote the index by \(y\).

\[\begin{align} F(y;\{\theta_y\}) = \sum_{y'\le y} \theta_y. \end{align}\]

Moments

Moments are not defined for a Categorical distribution because the value of \(y\) is not necessarily numeric.


Usage

Package

Syntax

NumPy

rng.choice(len(theta), p=theta)

SciPy

scipy.stats.rv_discrete(values=(range(len(theta)), theta))

Distributions.jl

Categorical(theta)

Stan

categorical(theta)



Notes

  • This distribution must be manually constructed if you are using the scipy.stats module using scipy.stats.rv_discrete(). The categories need to be encoded by an index. For interactive plotting purposes, below, we need to specify a custom PMF and CDF.

  • To sample out of a Categorical distribution, use rng.choice(), specifying the values of \(\theta\) using the p kwarg.


PMF and CDF plots

In the plot below, there are four categories, but we can only specify the probability for three of the four categories because the probability of the fourth is set by the normalization condition. If the parameters are such that the probabilities of the respective categories exceed one, the PMF is not displayed.