Fundamental Concepts in Probability, Statistics, and Estimation Theory for Engineers 

Introduction

Engineering is a field that often grapples with uncertainty. Whether it's determining the probability of system failure or estimate the parameters of a dynamical system such as self driving vehicles, engineers routinely leverage concepts from probability and statistics. This article provides an overview of some key concepts that are central to engineering applications. At the end of this article, the learner should understand the plot on the right, which is a vehical localization algorithm through probalistic models based on probability, random processes, and statistical inferences.

The Foundation: Probability Models and Axioms

Probability models and axioms form the backbone of our understanding of uncertainty and random events. There are three fundamental axioms of probability:

These axioms are the starting point for constructing more complex probabilistic models.

Probability Space and Random Variables

A probability space, integral to modeling random events, consists of a sample space (the set of possible outcomes), a set of events, and a probability measure that assigns probabilities between 0 and 1 to these events. Engineers represent the outcomes of random phenomena using random variables. The distribution of a random variable, such as Uniform, Normal, Poisson, or Laplace, outlines the probabilities of diverse outcomes and can model various types of uncertainty. 

Uniform Distribution: The uniform distribution, as depicted in the first plot, describes a situation where each outcome in the sample space is equally likely to occur. It's characterized by two parameters: "a" and "b", which define the minimum and maximum values, respectively. The probability density function (PDF) for a continuous uniform distribution is given as "1/(b-a)" for "a <= x <= b". In the field of robotics, this could model an event where a robot is equally likely to be at any position within a specified range.

Normal Distribution: The normal (or Gaussian) distribution is a bell-shaped curve that is used to represent a set of data where the majority of the data points cluster around the mean, and the rest taper off symmetrically towards either end. It's defined by two parameters: the mean ("mu") and the standard deviation ("sigma"). The PDF is given as "1/(sqrt(2pisigma^2)) * exp(-(x-mu)^2 / (2*sigma^2))". In robotics, sensor noise is often assumed to follow a normal distribution, reflecting the idea that the most likely measurement is the true value, with a symmetrical likelihood of error in either direction.

Laplace Distribution: The Laplace distribution, also known as the double exponential distribution, is similar to the normal distribution but has heavier tails. It is defined by two parameters: the mean ("mu") and the diversity ("b"). The PDF is given as "1/(2*b) * exp(-abs(x-mu)/b)". The Laplace distribution can model situations where extreme events are more likely than they would be under a normal distribution. In robotics, this might be used to model sudden and drastic changes in sensor readings, such as when a robot encounters an obstacle.

Poisson Distribution: The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space. It's defined by a single parameter ("lambda"), which is the average rate of occurrence. The probability mass function (PMF) is given as "(lambda^k * exp(-lambda)) / k!". In robotics, a Poisson distribution might model the arrival of signals to a sensor over a fixed period of time, where signals arrive independently and at a constant average rate.

Independence and Mutual Exclusivity.

In the world of probability, events can be independent, mutually exclusive, or neither. Two events are independent if the occurrence of one does not affect the probability of the occurrence of the other. They are mutually exclusive if they cannot both occur at the same time. Let's discuss each in detail: 

Independence: Two events,A and B, are said to be independent if the occurrence of A does not affect the probability of the occurrence of B, and vice versa. In other words, knowing that A has occurred does not change our belief about the likelihood of B occurring. This is mathematically expressed as

P(AB)=P(A)P(B)

where P(AB) represents the probability of both A and B occurring, and P(A) and P(B) represent the probabilities of A and B occurring, respectively. In the context of robotics, sensor readings might be treated as independent if we assume that the readings from different sensors do not influence each other.

Mutual Exclusivity: Two events, A and B, are mutually exclusive (or disjoint) if they cannot both occur at the same time. If A occurs, then B cannot occur, and vice versa. This is mathematically expressed asP(AB)=0, which states that the probability of both

A and B occurring is zero. For example, in a robot navigation task, the events "robot is at position X" and "robot is at position Y" would be mutually exclusive if positions X and Y are different.

Understanding the difference between these concepts is crucial as it can greatly affect how we calculate combined probabilities and model systems. A common mistake is to confuse independence with mutual exclusivity. While both concepts involve a degree of 'separation' between events, they imply very different things about the relationships between those events.


Joint, Marginal, and conditional Probability Mass Function

In probability theory, joint, marginal, and conditional probability mass functions (PMFs) are essential concepts for understanding the relationships between multiple random variables 

Joint Probability Mass Function

The joint probability mass function describes the probabilities of multiple random variables taking on specific values together. In the context of the article, it helps us understand the combined behavior of multiple variables, such as the simultaneous positions, orientations, or speeds of a robot in robotics applications. By knowing the joint probabilities, we can analyze the overall behavior and interactions of the system's variables. 

The joint probability mass function is denoted as:

P(X_1 = x_1, X_2 = x_2, ..., X_n = x_n)

This represents the probability that a set of n random variables X_1, X_2, ..., X_n takes on specific values x_1, x_2, ..., x_n simultaneously.

Marginal Probability Mass Function 

The marginal probability mass function focuses on the probabilities of individual random variables within the joint distribution. It allows us to study the behavior of each variable independently, regardless of the others. In the article's context, this is crucial for estimating the state of a single variable, like a robot's position, while disregarding the other variables. Marginal probabilities are useful in decision-making processes that involve single-variable analysis. 

The marginal probability mass function for a specific random variable X_k is denoted as:

P(X_k = x_k)

This represents the probability that the random variable X_k takes on a specific value x_k, regardless of the values of the other random variables in the collection X_1, X_2, ..., X_n.

Conditional Probability Mass Function 

The conditional probability mass function represents the probability of one random variable given the value of another random variable. In the context of the article, it is particularly valuable in understanding the relationship and dependencies between different variables. For example, in robotics, it can be used to estimate a robot's position (hidden variable) given sensor measurements (observable variable) by using techniques like Hidden Markov Models (HMMs) or Kalman filters. Conditional probabilities help us make informed decisions based on observed data and understand how the variables interact with each other. 

The conditional probability mass function for two random variables X_k and X_j is denoted as:

P(X_k = x_k | X_j = x_j)

This represents the probability that the random variable X_k takes on a specific value x_k, given that the random variable X_j has taken on a specific value x_j. It allows us to study the relationship between the two random variables by conditioning one variable on the other.

Usefulness for Estimation and Detection:

In summary, the concepts of joint, marginal, and conditional probability mass functions are fundamental in understanding and modeling the behavior of multiple random variables. In detection and estimation theory, they enable us to make informed decisions, estimate unknown parameters, and infer the state of systems with uncertainty. By using these probabilistic tools, engineers can design robust and efficient solutions for various real-world applications, including robotics, communication systems, image processing, and more.

Detection and Estimation Theory

This theory forms the bedrock of decision making in signal processing and data analysis. It includes two key components: detection, which decides whether a specific signal is present, and estimation, which discerns the values of the signal parameters. 

Terms:

Detection Theory:

The primary goal in detection theory is to decide between two or more hypotheses based on observed data. In the context of communication, for instance, this could involve deciding whether a received signal contains a specific message or just noise.

Key Concepts in Detection Theory:

Estimation Theory:

While detection is about deciding between discrete hypotheses, estimation focuses on determining the value of a continuous parameter or a vector of parameters based on observed data.

Key Concepts in Estimation Theory:

The Process of Signal Detection and Estimation

In the context of signal processing and data analysis, detection is often the first step, where you decide whether a specific signal is present in the data. Once the presence of the signal is confirmed, the next step is to estimate the parameters of the signal.

Signal Detection and Parameter Estimation

The first step is to detect whether a specific signal is present in the observed data. This typically involves comparing the data to a known pattern (or model) of the signal and deciding whether there is a match. The decision can be made based on a threshold or statistical test.. 

Parameter Estimation

Once the presence of the signal is confirmed, the next step is to estimate the parameters of the signal. This involves fitting the known pattern (or model) to the observed data in a way that minimizes the difference between the model and the data. Techniques like Maximum Likelihood Estimation (MLE) and Maximum A Posteriori Estimation (MAP) are commonly used for parameter estimation. 

Let's consider a simple example. Suppose you are monitoring a communication channel for a specific signal. The signal, if present, is known to follow a certain pattern (or model), but the exact parameters of the pattern (e.g., frequency, amplitude) are unknown.

The first step is to detect whether the signal is present. This typically involves comparing the observed data to the expected pattern and deciding whether there is a match. This decision could be based on a threshold or some statistical test.

Once the signal is detected, the next step is to estimate its parameters. This involves fitting the known pattern (or model) to the observed data in such a way that the difference between the model and the data is minimized. This could be done using various methods, such as least squares, maximum likelihood, or Bayesian methods, depending on the nature of the signal and the noise in the data.

So, in summary, detection and estimation are two key steps in signal processing and data analysis, and they often go hand in hand. The exact methods used for detection and estimation depend on the specifics of the problem, including the nature of the signal, the noise, and the available data.

Linear Estimation Techniques

Linear estimation involves finding the best linear function that approximates the relationship between the data and the parameters we want to estimate. One common technique in linear estimation is the Linear Minimum Mean-Squared-Error Estimator (also known as the Wiener filter). This technique minimizes the mean-squared error between the true values and the estimates, which makes it optimal under certain conditions.

Mathematically, a linear relationship between data y and parameters x can be represented as:

y=ax+b

where a and b are constants.

However, a nonlinear relationship might look something like:

y=ae^bx or y =a/(1+be^cx) 

where a, b, and c are constants, and e is the base of the natural logarithm.

When such nonlinear relationships exist, linear estimation methods like the basic Kalman filter or linear regression are not adequate. We need to use nonlinear estimation techniques, such as Maximum Likelihood Estimation (MLE), Maximum A Posteriori (MAP) estimation, or non-linear versions of the Kalman filter like the Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF). These techniques are designed to handle the complexities of nonlinear relationships and provide more accurate estimates.

Nonlinear Estimation Techniques and Reasons

When the relationship between the data and the parameters we want to estimate is non-linear, we often use non-linear estimation techniques such as Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.

Both methods are commonly used in machine learning and statistics for parameter estimation.

Explanation of Nonlinear Relationships:


The idea here is to understand the difference between linear and nonlinear methods in the context of parameter estimation, and how optimization comes into play.

When we talk about linear methods, we mean that the parameters we are trying to estimate enter into the model in a linear way. For instance, in a simple linear regression model, the relationship between the predictor variables and the outcome is assumed to be linear, and the parameters (the regression coefficients) can be estimated using linear algebra (specifically, matrix operations).

In mathematical terms, if we denote our predictors by X, our outcome by y, and our parameters (coefficients) by β, the forward model is:

y=+ϵ(thenoise term)

And the parameter β can be estimated as:

β_hat = (XX)^−1Xy 

 β_hat ​= (XX)^−1Xy  represents the solution for β_hat in ordinary least squares (OLS) regression, where: 

This formula provides the estimated parameters that minimize the sum of the squared residuals, which are the differences between the observed and predicted values of the response variable. Note, ordinary least squares (OLS) method can be sensitive to the noise term, e.

However, in many cases, the relationship between the predictors and the outcome is not linear, or the model itself may be nonlinear in the parameters. In such cases, we cannot use linear algebra to solve for the parameters directly, and instead must use an optimization approach.

In this case, we define a cost function (like the Mean Squared Error (MSE)) and use an optimization algorithm to find the parameters that minimize this cost. This is often done using iterative methods like gradient descent. The optimization problem can be written as:

θ_hat = argminθMSE(y,f(X;θ))

Here, f(X;θ) is our nonlinear model, θ are the parameters of the model, and _hat is the estimated value of the parameters that minimizes the MSE. The

⁡argmin operation means "find the argument that minimizes", and in this context, it means "find the parameters that minimize the MSE".

So, in summary, whether we use linear algebra or optimization to estimate our parameters depends on whether our model and the relationship between our predictors and outcome is linear or nonlinear.

Examples on Nonliear Estimation Methods

In the next section, we'll illustrate the use of a Kalman filter to estimate a signal that has been corrupted by Gaussian noise 

The plot will present three separate lines: the true signal, the measured signal, and the estimated signal. 

in conclusion, detection and estimation theory are essential components of signal processing and data analysis. They enable us to make informed decisions based on observed data, whether it's detecting the presence of specific signals or estimating parameters in statistical models. These techniques find wide application in various fields, including robotics, communications, and image processing, providing valuable insights from noisy and uncertain data. 

Bayesian Inference and Estimation Techniques

Bayes' theorem is a fundamental principle in probability theory and statistics that allows us to update the probability of a hypothesis based on new evidence. In Bayesian inference, we use Bayes' theorem to calculate the updated probability of a hypothesis (H) given some observed evidence (E): 

Bayes' Theorem: 

P(H|E) = P(E|H) * P(H) / P(E)

Where:

In Bayesian inference, we use Bayes' theorem to calculate the updated probability of a hypothesis (H) given some observed evidence (E). Bayesian estimation techniques, such as Maximum A Posteriori (MAP) and Markov Chain Monte Carlo (MCMC), utilize Bayes' theorem to update beliefs about unknown parameters or states as more data is gathered.

By incorporating prior knowledge and continuously updating estimates as new data arrives, Bayesian inference and estimation techniques are invaluable for various applications, including robotics, statistical modeling, machine learning, and data analysis.


Here's how to interpret the plot:

f(x | mu, sigma^2) = 1 / sqrt(2 * pi * sigma^2) * exp( - (x - mu)^2 / (2 * sigma^2) )

where:

In this example, the prior and the sensor reading are not very close to each other, indicating that the previous location reading and the new sensor reading don't agree perfectly. Despite this disagreement, the Bayesian inference process still enables us to update our belief in a principled way. As we receive more sensor readings, we can continually update the prior and converge on a more accurate estimate of the robot's location.

Graphical Models

Bayesian networks, Markov random fields, and Hidden Markov Models are all types of graphical models. They are used to represent a set of variables and their conditional dependencies through a directed acyclic graph (DAG). For a set of random variables, a graphical model attempts to capture certain aspects of their joint distribution. 

Directed Graphs

A directed graph (also called a "digraph") is a set of vertices and a set of directed edges. Each directed edge has an initial vertex, called its tail, and a terminal vertex, called its head. The directed edge is said to point or go from the tail to the head.

A simple example of a directed graph with three nodes could be represented as follows:

 A → B → C

In this graph, we have an edge from A to B and an edge from B to C. This implies that B depends on A, and C depends on B. There's no direct dependence of C on A, or B on C, or A on B.

Undirected Graphs

An undirected graph is a graph in which edges have no orientation. The edge (x,y) is identical to the edge (y,x), i.e., they are not ordered pairs, but sets{x,y} of vertices.

A simple example of an undirected graph with three nodes could be represented as follows:

 A ─ B ─ C

In this graph, we have an edge between

A and B and an edge between B and C. Unlike the directed graph, this implies a mutual relationship between A and B and between B and C. This could be interpreted as B being related to A, and C being related to B, but it doesn't imply a direction of influence or dependence as in the directed graph.

In the context of graphical models, directed graphs are often used for causal relationships, as in Bayesian networks, where the direction of the arrow can be interpreted as a cause-and-effect relationship. Undirected graphs are often used to represent relationships of mutual influence, as in Markov networks, where the absence of arrows indicates a bidirectional relationship.

Markov Conditions

In a graphical model, the Markov conditions are a set of conditions that a probability distribution must satisfy in order to be Markov with respect to a graph. These conditions are used to express the notion of conditional independence.

The three Markov conditions are:

The Markov conditions define a Markov Random Field in the undirected graph and a Bayesian Network in the directed graph.

Bayesian Networks (Bayes Net)

A Bayesian network (also known as a Bayes net) is a probabilistic graphical model that uses a directed acyclic graph (DAG) to represent a set of variables and their conditional dependencies. Each node in the graph represents a random variable, while the edges between the nodes represent the conditional dependencies.

Bayesian networks are great for demonstrating concepts of dependence and conditional independence. In a Bayesian network, if there is no path from node

A to node B, the two nodes are conditionally independent.

Here's an example of a simple Bayes Net:

  Rain

  / \

              /       \

       Umbrella    Wet Road

In this network, the state of the Umbrella or Wet Road variable depends on the "Rain" variable, and Umbrella are Wet Road independent.

Let's expand on the concept of dependece and conditional independence which is essential to understand and interpretate graphic models:

Conditional Probability

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event has already occurred. If the event of interest is A and event B is known or assumed to have occurred, the conditional probability of A given B is usually written as P(AB).

In the context of graphical models, if there is a directed edge from variable X to variable Y, we interpret this as "the probability of Y depends on ". This can be written as P(YX).

Independence

In probability theory, two events are independent if the occurrence of one does not affect the probability of occurrence of the other (equivalently, does not affect the odds). Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In graphical models, if there is no edge between two variables, we say that they are independent.

Conditional Independence

Two events A and B are conditionally independent given another event C if the occurrence of A and the occurrence of B are independent events in their conditional probability distribution given C.

In terms of graphical models, if we know the state of a variable, it can render some variables independent. For example, in a Bayesian network, if we know the state of a variable X, and X has children nodes Y and Z, then Y and Z are conditionally independent given X.

Let's illustrate this with an example Bayesian network:

    A

   /  \

     B     C

Here, B andC are dependent variables because they both depend onA. However, given A, B and C become conditionally independent. This is because, once

A is known, it 'blocks' the path between B and C, rendering them independent. Mathmatically, his can be expressed as

P(B,CA)=P(BAP(CA), says that the joint probability of B (umbrella is used) and C (road is wet) given A (it rained), is equal to the product of the individual probabilities of B and C given A.  


This can be interpretated that the probability that the umbrella is used and the road is wet, given that it rained, is equal to the product of the probability that the umbrella is used given that it rained and the probability that the road is wet given that it rained. 


In another word, if you know that it rained, the fact that the road is wet doesn't provide any additional information about whether an umbrella was used, and vice versa. Therefore, the two events (umbrella is used, road is wet) are conditionally independent given the event that it rained. 

Markov Random Fields (Undirected Graph)

A Markov Random Field (MRF), also known as a Markov network, is a model over an undirected graph. A key property of MRFs is that they obey the Markov property: the value of a particular node is conditionally independent of all other nodes, given its neighbors in the graph.

Example of a Markov Random Field

Markov Random Fields (MRFs) are typically used in computer vision and image processing. Let's consider an example where we use an MRF to model the pixels of a black and white image.

In this case, each pixel is a random variable that can take on two values: 1 (white) or 0 (black). We can create a graph where each node represents a pixel, and nodes are connected if they are neighboring pixels.

The idea is that the color of a pixel is likely to be similar to the color of its neighbors. In terms of the MRF, this means that a pixel is conditionally independent of all other pixels, given the values of its neighbors.

The nodes "Pixel 1", "Pixel 2", "Pixel 3", and "Pixel 4" represent random variables (in this case, image pixels), and the undirected edges between them represent the relationships between neighboring pixels. In this model, the value of each pixel is likely to be similar to the values of its neighboring pixels. For instance, there is an edge between "Pixel 1" and "Pixel 2", indicating that these two pixels are neighbors and therefore their values are likely to be similar. 

Interpretation of the diagram: 

Hidden Markov Models (Directed Graph)

A Hidden Markov Model (HMM) is a statistical model where the system being modeled is a Markov process with unobserved (hidden) states. HMMs are mainly used in the field of temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges, and bioinformatics.

Example of a Hidden Markov Model

Hidden Markov Models (HMMs) are often used in time series analysis and in applications like speech recognition, natural language processing, and bioinformatics.

Consider a simple weather prediction model as an example. Suppose we are trying to predict whether it will be rainy or sunny tomorrow, but we cannot directly observe the weather. Instead, we can observe whether a person carries an umbrella or not.

In this case, the true weather condition (rainy or sunny) is the hidden state, and whether or not the person carries an umbrella is the observable state. We can model the weather condition as a Markov process, meaning that tomorrow's weather depends only on today's weather, and not on the weather of previous days.

The likelihood of the person carrying an umbrella depends on the weather condition. For example, the person is more likely to carry an umbrella if it is rainy.

The graph above represents a Hidden Markov Model (HMM) with both hidden states ("Sunny" and "Rainy") and observed states ("Umbrella" and "No Umbrella").

The top layer represents the hidden states and the bottom layer represents the observed states. 

The directed edges represent the transitions between states. For example, the edge from "Sunny" to "Rainy" with a weight of 0.2 represents the probability of transitioning from a sunny day to a rainy day.

The edges from the hidden states to the observed states represent the relationship between the weather conditions and the likelihood of observing someone with or without an umbrella. For instance, the edge from "Rainy" to "Umbrella" with a weight of 0.8 signifies that there is an 80% chance of observing someone with an umbrella given that the day is rainy.

Graphcal models, such as Bayesian networks, Markov random fields, and Hidden Markov Models, are mathematical structures used to model the dependencies among a set of random variables. They are essential tools in machine learning and are used in many engineering applications, including speech recognition, image processing, and bioinformatics.

Random Processes

A random process, also known as a stochastic process, is a collection of random variables indexed by time or space. The random variables represent the outcomes of the process at different points in time or space. For example, the daily temperature in a location, the stock market prices over time, or the signal received by a sensor are all examples of random processes.

For a given random process, the distribution of outcomes can change over time or space. This is because the underlying conditions that govern the process can change. For instance, in a weather system, the distribution of possible temperatures can change from day to night or from season to season. Similarly, in a stock market, the distribution of price changes can vary based on market conditions, news events, etc.

A random process is said to be stationary if its statistical properties do not change over time. This means that the distribution of outcomes is the same at all points in time. For example, if the daily temperature is equally likely to be any value between 0 and 100 degrees every day of the year, then this would be a stationary process. However, most real-world processes are not stationary because the conditions that drive the process change over time.

There are also processes that are wide-sense stationary (WSS) or weakly stationary, which means that the mean and variance of the process do not change over time, and the autocorrelation between two time points depends only on the difference between the time points, not the actual time at which the correlation is computed. This is a weaker condition than strict stationarity.

On the other hand, ergodic processes are those where the time averages for a single realization converge to the ensemble averages. In other words, given a single, infinitely long realization of the process, we could compute the process's statistical properties accurately. 

Note: When the time averages and ensemble averages are the same means the process behaves the same way on average whether we look at one realization over time or many realizations at one moment. This could be interpreted as a kind of repeatability: the process repeats its overall behavior in each realization. This property allows us to make inferences about the overall behavior of the process based on a single, sufficiently long realization.

Let's examined the details:

The concept of ergodicity connects these two types of averages. A random process is said to be ergodic if its time averages are equal to its ensemble averages for all times and all points in space. This means that the long-term behavior of a single realization (time average) is representative of the behavior of the ensemble of realizations (ensemble average). This property is extremely useful because it allows us to make inferences about the overall process based on a single, sufficiently long realization.

Below is an depiction of random walk of time average vs. ensemble average:

The process demonstrated above is called a "Random Walk". It's a simple random process where each step is independent and equally likely to go up (1) or down (-1).

Regarding ergodicity, a simple symmetric random walk (like the one above) is indeed ergodic. The ergodic theorem for random walks states that, given a long enough walk, the average value of the walk (time average) will converge to the expected value (ensemble average). In our example, both the time average and the ensemble average are equal to zero, so the ergodic property holds.

This property is fundamentally important because it allows us to substitute time averages for ensemble averages. In other words, we can observe a single realization of the process for a long time to understand the behavior of the process as a whole. This is crucial in many practical situations where we only have one realization of a process (for example, a single run of an experiment or a single data set) and want to make inferences about the underlying random process.

In summary, the distribution associated with a random process can change over time because the underlying conditions of the process can change. Whether the distribution changes, and how it changes, depends on the specifics of the process and the system it represents. In any case, it's the randomness and the change over time that make a random process interesting and challenging to analyze and predict. 

Random Variables and Random Vectors in the Context of Random Processes 

A random variable is generally defined as a variable whose possible values are outcomes of a random phenomenon. It is a function that maps the outcomes of a random experiment to numerical values. For example, in a single roll of a six-sided die, the outcome (which number is face-up) is a random variable. This random variable only "takes a value" when the die is rolled.

In the context of a random process (also known as a stochastic process), you can think of a different random variable being defined at each point in time. For example, in the context of weather forecasting, the temperature at 2 pm is a different random variable than the temperature at 3 pm, even though they may be related.

In a continuous random process, you could think of an uncountably infinite number of random variables: one for every instant in time. For a discrete random process, there might be a countably infinite number of random variables: one for each discrete time step.

In the context of a random process, a random vector can be used to capture the state of the process at multiple points in time or space. In short, it is simply a collection of multiple random variables. X = [X1 , X2 , ... , Xn], X is a random vector or simply put a vector of random variables. For instance, samples of an audio packets stored in a fixed size of buffer[X0, X1, X2, ..Xm] is a random vector consisteing for random variable X(n) = X(nT) (n is the sample index and T is the sampling interval.

It can represent the states of a process at different time points. For example, in observing daily temperatures, the temperatures over a week would form a random vector. Each entry in the vector corresponds to the temperature (a random variable) on a different day.

A random vector is particularly useful when the random variables it contains are not independent. In this case, the joint distribution of the variables in the vector can capture correlations between the states of the process at different times or places. This can be valuable information for understanding and predicting the process. 

Random variable: Temperature on day 101: [38.83] °C

Random vector: Temperatures on week 21: [ 5.09 24.39 21.67 26.35 43.83 29.44 10.87] °C


The plot above represents a random process: daily temperatures over a year. This is a simulated process where temperature values are generated from a normal distribution, reflecting the daily variation in temperature.

A random variable in this context could be the temperature on a specific day. For instance, the temperature on the 101st day is 38.83 °C. It's a random variable because it's an outcome of a random process (daily temperature variations).

A random vector could be the temperatures over a specific week. For instance, the temperatures on the 21st week of the year are [5.09, 24.39, 21.67, 26.35, 43.83, 29.44, 10.87] °C. Each entry in this vector is the temperature on a different day, so the vector gives us a week's worth of temperature data.

Convergence of a Sequence of Random Variables

In many applications, we are interested in the behavior of a sequence of random variables as it progresses. This leads to the concept of convergence of a sequence of random variables. There are several types of convergence, including almost sure convergence, convergence in probability, and convergence in distribution.

This plot shows an example of convergence in probability. Each point on the plot represents the mean of a random variable drawn from a uniform distribution that ranges from -1/n to 1/n. As n increases, the range of the distribution narrows, and the mean of the distribution converges to 0. This is a demonstration of the law of large numbers, which states that the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. 

Jointly Gaussian Random Variables and Vectors

Jointly Gaussian (or multivariate Gaussian) random variables and vectors are an important class of random processes. A set of random variables is jointly Gaussian if any linear combination of them follows a Gaussian distribution. Gaussian processes are widely used in engineering, especially in signal processing and machine learning, due to their mathematical tractability and physical relevance.

The scatter plot shows an example of jointly Gaussian random variables. The equation on the plot describes the properties of the joint Gaussian distribution: 


The red ellipse on the plot is a confidence ellipse that contains about 99.7% of the samples, assuming the samples are jointly Gaussian. It provides a visual representation of the covariance matrix: the orientation of the ellipse indicates the correlation between the variables, and the lengths of the axes of the ellipse are proportional to the standard deviations of the variables. 

Random Walks and Brownian Motion

Random walks and Brownian motion are specific types of random processes that are particularly important in fields such as finance, physics, and engineering. A random walk is a sequence of random variables where each step is determined by a random draw from a certain distribution. Brownian motion is a continuous-time random walk, often used to model the motion of particles suspended in a fluid or stochastic market processes.

This plot shows an example of a random walk. At each step, the position changes by +1 or -1 with equal probability. Over time, the position undergoes a "drunkard's walk" pattern, seemingly wandering aimlessly up and down the number line. Despite its simplicity, the random walk model has wide-ranging applications, from modeling stock prices in finance to modeling the motion of gas molecules in physics.

Stationarity, Wide-Sense Stationarity, and Ergodicity

A random process is said to be stationary if its statistical properties do not change over time. This means that the mean, variance, and autocorrelation function of the process are time-invariant. A weaker form of stationarity process is wide-sense stationary (WSS) because only the process's mean and autocorrelation function are time-invariant. A process is ergodic if its time-averaged behavior is the same as its ensemble-averaged behavior, meaning we can learn about the process's long-term behavior by observing a single realization over a long time period.

Let's demonstrate these properties with a simple example: a sine wave with added Gaussian noise.

A sine wave is deterministic, so it's not a random process and doesn't have the properties of stationarity or ergodicity. However, if we add Gaussian noise to the sine wave, we get a random process. The resulting process is WSS because the mean and autocorrelation function do not change over time. However, it's not stationary because the distribution of the process changes over time (it moves up and down with the sine wave). Finally, it's ergodic because the time averages and ensemble averages are the same.

We'll generate a plot of the process, and compute the time average and ensemble average to show that they're the same, demonstrating ergodicity.

The plot above shows a random process that is a sine wave with added Gaussian noise. The process is wide-sense stationary because its mean (shown by the red dashed line) and autocorrelation function (not shown) do not change over time. The process is not strictly stationary because the distribution of the process changes over time (it moves up and down with the sine wave).

The process is ergodic because the time average (red dashed line) is the same as the ensemble average (green dashed line), meaning we can compute the process's statistical properties from a single, infinitely long realization of the process. In this case, the time average and ensemble average are approximately the same, which is what we expect for an ergodic process.

Real world Random Processes

1. Shot Noise as Poisson Process

The first plot above shows a realization of a Poisson process, where the x-axis represents time and the y-axis represents the number of events that have occurred by each time point. The Poisson process is characterized by a parameter λ, which represents the average rate of events per unit of time. In this simulation, we used λ=10, meaning that on average, 10 events occur per unit of time. So the time average is 10 for possion process.

The second plot shows shot noise derived from the Poisson process. In this model, we assume that each "event" in the Poisson process corresponds to the passage of an electron across a barrier, which generates a noise pulse with a certain amplitude.

Shot noise arises in electronics and telecommunications due to the discrete nature of electric charge. It's called "shot noise" because the charge carriers (like electrons) that carry current across a barrier or junction do so in a manner that can be thought of as a stream of "shots" being fired.

The Poisson process is a simple and commonly used statistical model for systems where events happen sporadically and independently of each other over time. This makes it a good model for shot noise, where each "shot" (or the passage of an electron across a barrier) is an independent event.

2. Thermal Noise as Gausian Process

Thermal noise, also known as Johnson-Nyquist noise, is the electronic noise generated by the thermal agitation of the charge carriers (usually the electrons) inside an electrical conductor at equilibrium, which happens regardless of any applied voltage. The random motion of electrons due to thermal agitation results in a net fluctuation in the voltage or current, which is what we refer to as thermal noise. 

Thermal noise is modeled as a Gaussian process because the movement of each electron can be considered independent of the others, and the net result of a large number of independent, random movements is a Gaussian distribution due to the central limit theorem.

The power spectral density of thermal noise is constant and is given by N0 =kT 

k is Boltzmann's constant and T is the absolute temperature in Kelvin. This means that the power of the thermal noise is the same at all frequencies, which is why it's often referred to as "white noise."

Let's look at its properties.

3. Pink Noise as Fractional Gaussian Process

Pink noise, also known as 1/f noise, is a signal or process that has a frequency spectrum such that the power spectral density (power per frequency interval) is inversely proportional to the frequency of the signal. In pink noise, each octave carries an equal amount of noise power. The name arises from the pink appearance of visible light with this spectral distribution. In terms of random processes, pink noise is a type of fractional Gaussian noise. This means it can be modeled as a Gaussian process, but with a correlation structure that leads to its 1/f power spectrum 

Here are some key properties of pink noise:

In summary, pink noise is a common type of noise found in many natural and man-made systems. Its 1/f power spectrum, long-range dependence, and other properties make it a unique and interesting topic of study in fields like physics, engineering, and data analysis. It is also used in audio and acoustics to create balanced noise signals or to test audio equipment. 

4. Brownian Motion/ Wierner Process

The plots above display a realization of Gaussian noise and the corresponding Brownian motion derived from that noise. In the Brownian motion plot on the middle, the position at a particular time is the cumulative sum of the Gaussian noise up to that point 

Brownian motion is a type of random walk where each step is a Gaussian random variable. It's used to model a variety of phenomena, including the motion of particles in a fluid, stock prices, and more. We've already seen an example of a random walk. 

You can actually think of Brownian motion as the integral of Gaussian white noise. Each tiny step the particle takes in a Brownian motion process is a sample of Gaussian noise, and the position of the particle at any point in time is the sum (or integral) of all these steps. 

Random Processes Through Linear Systems

Random processes can be passed through linear systems, resulting in new random processes. The output process depends on the input process and the characteristics of the system. This concept is fundamental in fields like signal processing, where we often want to understand the behavior of a random signal (like noise) as it passes through a filter

This plot shows an example of a random process (a sequence of random values, in this case, from a normal distribution) passing through a linear system. The linear system in this case is a moving average filter, which smooths the signal by replacing each point with the average of the surrounding points. The filtered signal, shown in orange, has less high-frequency content and appears smoother than the original signal. This illustrates how a linear system can alter the characteristics of a random process. In real-world applications, this could represent noise reduction or signal conditioning in a signal processing system. 

Probabilistic Methods in State Estimation: From Theory to Real-World Robotics Applications 

Introduction

The field of Robotics heavily relies on probabilistic algorithms for various functions, from perception and decision making to control and interaction. As robots interact with an uncertain and often unpredictable world, accurately estimating their state (position, orientation, speed, etc.) and the state of their surroundings becomes vital for efficient task execution.

Background Concepts: Random Processes, Nonlinear State Estimation, and Probabilistic Methods

In the application of probabilistic algorithms to robotics, several key concepts form the backbone of our understanding and approach.

Random Processes

Random processes are fundamental to understanding and modeling the inherent uncertainty that exists in real-world robotics applications. This uncertainty stems from various sources, such as sensor noise, actuator errors, model inaccuracies, or an unpredictable environment. Recognizing and accounting for these random processes is vital for creating robust and reliable robotic systems.

Nonlinear State Estimation Methods

In many practical scenarios, the dynamics of a system are not linear. For instance, the movement of a robot often depends on its current position, orientation, speed, and control inputs, all of which can relate in nonlinear ways. Similarly, sensor readings can be nonlinear functions of the state of the robot and the environment.

When the system dynamics are nonlinear, linear methods like the basic Kalman filter cannot provide accurate state estimates. This necessitates the use of nonlinear methods such as the Extended Kalman Filter or the Particle Filter. These methods can handle the nonlinearities and provide more accurate state estimates, making them invaluable tools in robotics.

Probabilistic Methods

Probabilistic methods are employed to handle the inherent uncertainty in robotics. These methods use the principles of probability theory to model and reason about uncertainty. For instance, if a robot's sensors are noisy, the robot can model the sensor readings as random variables with known probability distributions. The robot can then use these distributions to make probabilistic inferences about its state and the state of the environment.

By using probabilistic methods, a robot can quantify its uncertainty and make decisions that account for this uncertainty. For example, if a robot is uncertain about its position, it can choose actions that reduce this uncertainty. Similarly, if a robot is uncertain about the state of the environment, it can choose actions that are likely to succeed under a range of possible environmental states.

Several probabilistic algorithms and techniques have become particularly important in the field of robotics:

State Representation Using Markov Random Fields and Hidden Markov Models

In robotics, understanding and accurately representing the state of a system is a crucial task. To handle the complexities and uncertainties inherent in real-world environments, we often turn to probabilistic graphical models such as Markov Random Fields (MRFs) and Hidden Markov Models (HMMs).

Markov Random Fields (MRFs)

An MRF captures the local dependencies among variables in a system. In state estimation, it can represent the relationships among different components of a robot's state or between the robot and its environment. 

In a robot localization problem, the robot's position at different time steps can be represented as an MRF, with each node in the graph corresponding to the robot's position at a given time step and adjacent positions connected by edges. The robot's position at a particular time step depends only on its position at the previous time step, a property that can be leveraged to efficiently estimate the robot's trajectory. 

Hidden Markov Models (HMMs)

An HMM is a specific type of MRF where some variables (the state of the system) are hidden or unobservable, and we can only observe the outcomes influenced by these hidden variables. This model is particularly useful when the system's state cannot be directly observed.

For instance, in a robot navigation problem, the robot may not be able to directly observe its position due to sensor limitations or environmental obscurities. However, it can observe landmarks or features in the environment, which are influenced by its position. An HMM can effectively model this situation: the hidden states correspond to the robot's actual position, while the observations represent the detected landmarks or features. Given the observed landmarks and known transition probabilities between positions, the robot can use the HMM to estimate its most probable location. 

Let's Illustrate Monte Carlo Localization (MCL) with an 1D Robotic localization Example:

Monte Carlo Localization (MCL), sometimes known as particle filter localization, is a method used to estimate the position and orientation of a robot in a known environment based on a map and sensor measurements. It is a probabilistic method that utilizes a set of particles to represent the possible positions of the robot. 

The basic idea behind Monte Carlo Localization is as follows:

Let's visualize this step-by-step using a simple 1D example.

Step 1: Initialization

We'll create a simple 1D world of size 100 units and initialize 1000 particles randomly in this world.

The plot above shows the initial distribution of particles in our 1D world. Each dot represents a particle, and they are uniformly distributed along the entire length of the world. 

Step 2: Motion Update (Prediction)

Let's simulate the robot's movement. We'll assume the robot moves to the right by a certain distance, but there's some noise in the motion. We'll update the position of each particle based on this motion, adding a bit of noise to simulate uncertainty in the robot's movement.

We'll move the robot (and particles) by 20 units to the right, with a motion noise of 2 units.

The plot shows the distribution of particles after the motion update. As expected, the particles have shifted to the right, reflecting the robot's motion. The added noise creates some spread in the particles, simulating the uncertainty in the robot's movement.

Step 3: Measurement Update (Correction)

Let's assume the robot has a sensor that measures its distance to a landmark located at position 50 in the 1D world. This measurement will also have some noise. After taking the measurement, we will assign a weight to each particle based on how well its position matches the expected measurement.

We'll use a Gaussian function to compute the weight of each particle:2

wi=e^1/2((Zexpected−Zobserved 2)/σ )^2

where: 

wi is the weight of particle i

Let's assume the robot observes a distance of 30 units to the landmark, with a measurement noise (standard deviation) of 2 units.


The plot above shows the particle weights after the measurement update. Particles that are closer to the expected position (given the observed measurement) have higher weights (shown in yellow), while those farther away have lower weights (shown in purple).

Step 4: Resampling

Based on the weights assigned in the measurement update, we'll now perform the resampling step. This will result in particles with higher weights being duplicated (i.e., they will appear more times in the new set of particles), while particles with lower weights might disappear.

Resampling can be done in various ways, but one common method is the "resampling wheel" algorithm. Let's resample our particles using this method.

The plot above shows the particle distribution after the resampling step. As you can see, particles that had higher weights (i.e., they were closer to the expected position given the observed measurement) are now more numerous in the distribution. Conversely, particles with lower weights are less prevalent or have disappeared.

The clusters you see around positions 20 and 80 can be explained based on the motion and measurement updates.

Both positions 20 and 80 are consistent with the robot's sensor measurement, and hence you see clusters of particles around these positions after the resampling step.

The particles at position 20 represent the possibility that the robot moved from the beginning of our world (0) directly towards the landmark, while the particles at position 80 represent the possibility that the robot was initially near the landmark and then moved away from it.

This resampling process essentially refines our estimate, making the particle cloud more concentrated around the most likely positions of the robot

Bayesian Principles in Monte Carlo Localization

The entire process of Monte Carlo Localization is a manifestation of Bayesian reasoning:

By repeatedly applying these steps, MCL allows robots to refine their position estimates as they move and sense the environment, all grounded in probabilistic and Bayesian principles.

That's a basic walkthrough of the Monte Carlo Localization process in a 1D world. In real-world scenarios, this would be extended to 2D or 3D, but the principles remain the same.

Finally Let's priovde an 2D Robotic localization Example

Description of What Happened:

In the provided example, the robot's motion was designed as a rotation of the initial motion vector [10,10] in increments of π/8 radians (or 22.5∘) for each iteration. This creates a pattern where the robot moves in a series of vector rotations, giving the appearance of a spiral or circular trajectory as the iterations progress. 

The repeated application of motion updates, measurement updates, and resampling allows the robot to refine its position estimate iteratively, despite uncertainties in motion and measurements. Over time, as the process iterates, the cloud of particles will converge to the true position of the robot, providing an accurate probabilistic estimate of its location in the environment. 

Modern Methods in Robotics: From SLAM to Deep Learning


In the field of robotics, especially in the context of state estimation, localization, and mapping, various modern methods have been developed that build upon and extend the basic principles of probabilistic graphical models like MRFs and HMMs. Some of the modern methods widely adopted are:

It's important to note that while these methods have shown impressive results, they are not without their challenges. For instance, deep learning methods require a lot of data and computational resources, and they often lack the interpretability of traditional probabilistic methods. On the other hand, reinforcement learning methods can require a large number of trials to learn a policy, which can be a challenge in real-world robotics applications where safety is a concern.


Summary and Conclusion

In conclusion, the field of Robotics heavily relies on probabilistic methods, detection and estimation theory, and graphical models to tackle the challenges posed by uncertainty in real-world applications. From perception and decision making to control and interaction, these concepts serve as fundamental tools for designing robust and efficient robotic systems.

Probabilistic methods provide a principled way to model and reason about uncertainty, allowing robots to make informed decisions in complex and uncertain environments. The concepts of random processes and probabilistic reasoning enable robots to quantify uncertainty and choose actions that optimize performance under varying conditions.

Detection and estimation theory play a vital role in signal processing and data analysis, allowing robots to detect the presence of specific signals and estimate their parameters accurately. The ability to detect signals and estimate their parameters is crucial for a wide range of applications, including localization, mapping, and object recognition.

Graphical models, such as Bayesian networks, Markov random fields, and Hidden Markov Models, provide powerful frameworks for modeling dependencies and relationships among variables in robotic systems. They enable robots to represent complex interactions and make predictions based on observed data.

Through the combination of these essential concepts, engineers and researchers can develop sophisticated, autonomous robotic systems capable of navigating uncertain environments, making accurate decisions, and interacting intelligently with the world around them. By harnessing the principles of probability and statistics, robotics continues to advance and transform industries, improving the quality of life and driving innovation across various domains.