Data Formation and Random Variables
Data is often collected by sampling from a pool of measurable sets of numerical values. Each sample in this pool can be thought of as originating from a specific distribution that characterizes the properties of the dataset. In statistical terms, we represent each sample as a random variable, which is a cornerstone concept in probability theory and statistics. A random variable is a function that maps an outcome, such as the height of a person, to a real number on the number line: X(Height)→R. The importance of this mapping allows us to apply math to numerical terms to gain insights into these data sets. Specifically, understanding random variables often involves using calculus for operations like integration and differentiation to analyze probability density functions.
To empirically understand the distribution of human heights, we might sample from a reasonably large pool of candidates, say 30 people. We measure their heights, round the measurements to two decimal places, and plot the frequencies of each height on a histogram. In this histogram, the x-axis represents the height, and the y-axis represents the frequency of each height bin. A 'bin' refers to a specific range of heights and in this example, we can assume that the bin width is 10 cm due to rounding.
Identically Distributed Random Variables
In our example, each person in the sample pool serves as an individual random variable, and each random variable takes a value from a known distribution—namely, the distribution of human heights. Since all the sampled heights must conform to this known distribution, we say that the random variables are identically distributed.
Independence is a cornerstone concept in probability theory. It means that the outcome of one random variable provides no information about the outcome of another. Mathematically, this is expressed as
P(X1=1.8 m and X2=1.7 m)=P(X1)×P(X2). In essence, the joint probability X1 and X2 of having specific heights is the product of their individual probabilities.
Importance of I.I.D Assumptions
The i.i.d assumption simplifies many statistical analyses. For instance, if we want to understand the joint distribution of all these random variables (i.e., a multivariate distribution), we can simply multiply their individual distributions together due to their independence and identical distribution.