# Fundamental Concepts in Probability, Statistics, and Estimation Theory for Engineers

## Introduction

Engineering is a field that often grapples with uncertainty. Whether it's determining the probability of system failure or estimate the parameters of a dynamical system such as self driving vehicles, engineers routinely leverage concepts from probability and statistics. This article provides an overview of some key concepts that are central to engineering applications. At the end of this article, the learner should understand the plot on the right, which is a vehical localization algorithm through probalistic models based on probability, random processes, and statistical inferences.

## The Foundation: Probability Models and Axioms

Probability models and axioms form the backbone of our understanding of uncertainty and random events. There are three fundamental axioms of probability:

The probability of any event is greater than or equal to zero.

The probability of all possible events within the sample space sums to 1.

If two events are mutually exclusive (i.e., they cannot occur together), the probability of both events happening is zero.

These axioms are the starting point for constructing more complex probabilistic models.

## Probability Space and Random Variables

A probability space, integral to modeling random events, consists of a sample space (the set of possible outcomes), a set of events, and a probability measure that assigns probabilities between 0 and 1 to these events. Engineers represent the outcomes of random phenomena using random variables. The distribution of a random variable, such as Uniform, Normal, Poisson, or Laplace, outlines the probabilities of diverse outcomes and can model various types of uncertainty.

Uniform Distribution: The uniform distribution, as depicted in the first plot, describes a situation where each outcome in the sample space is equally likely to occur. It's characterized by two parameters: "a" and "b", which define the minimum and maximum values, respectively. The probability density function (PDF) for a continuous uniform distribution is given as "1/(b-a)" for "a <= x <= b". In the field of robotics, this could model an event where a robot is equally likely to be at any position within a specified range.

Normal Distribution: The normal (or Gaussian) distribution is a bell-shaped curve that is used to represent a set of data where the majority of the data points cluster around the mean, and the rest taper off symmetrically towards either end. It's defined by two parameters: the mean ("mu") and the standard deviation ("sigma"). The PDF is given as "1/(sqrt(2pisigma^2)) * exp(-(x-mu)^2 / (2*sigma^2))". In robotics, sensor noise is often assumed to follow a normal distribution, reflecting the idea that the most likely measurement is the true value, with a symmetrical likelihood of error in either direction.

Laplace Distribution: The Laplace distribution, also known as the double exponential distribution, is similar to the normal distribution but has heavier tails. It is defined by two parameters: the mean ("mu") and the diversity ("b"). The PDF is given as "1/(2*b) * exp(-abs(x-mu)/b)". The Laplace distribution can model situations where extreme events are more likely than they would be under a normal distribution. In robotics, this might be used to model sudden and drastic changes in sensor readings, such as when a robot encounters an obstacle.

Poisson Distribution: The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space. It's defined by a single parameter ("lambda"), which is the average rate of occurrence. The probability mass function (PMF) is given as "(lambda^k * exp(-lambda)) / k!". In robotics, a Poisson distribution might model the arrival of signals to a sensor over a fixed period of time, where signals arrive independently and at a constant average rate.

## Independence and Mutual Exclusivity.

In the world of probability, events can be independent, mutually exclusive, or neither. Two events are independent if the occurrence of one does not affect the probability of the occurrence of the other. They are mutually exclusive if they cannot both occur at the same time. Let's discuss each in detail:

Independence: Two events,A and B, are said to be independent if the occurrence of A does not affect the probability of the occurrence of B, and vice versa. In other words, knowing that A has occurred does not change our belief about the likelihood of B occurring. This is mathematically expressed as

P(A∩B)=P(A)P(B),

where P(A∩B) represents the probability of both A and B occurring, and P(A) and P(B) represent the probabilities of A and B occurring, respectively. In the context of robotics, sensor readings might be treated as independent if we assume that the readings from different sensors do not influence each other.

Mutual Exclusivity: Two events, A and B, are mutually exclusive (or disjoint) if they cannot both occur at the same time. If A occurs, then B cannot occur, and vice versa. This is mathematically expressed asP(A∩B)=0, which states that the probability of both

A and B occurring is zero. For example, in a robot navigation task, the events "robot is at position X" and "robot is at position Y" would be mutually exclusive if positions X and Y are different.

Understanding the difference between these concepts is crucial as it can greatly affect how we calculate combined probabilities and model systems. A common mistake is to confuse independence with mutual exclusivity. While both concepts involve a degree of 'separation' between events, they imply very different things about the relationships between those events.

## Joint, Marginal, and conditional Probability Mass Function

In probability theory, joint, marginal, and conditional probability mass functions (PMFs) are essential concepts for understanding the relationships between multiple random variables

Joint Probability Mass Function

The joint probability mass function describes the probabilities of multiple random variables taking on specific values together. In the context of the article, it helps us understand the combined behavior of multiple variables, such as the simultaneous positions, orientations, or speeds of a robot in robotics applications. By knowing the joint probabilities, we can analyze the overall behavior and interactions of the system's variables.

The joint probability mass function is denoted as:

P(X_1 = x_1, X_2 = x_2, ..., X_n = x_n)

This represents the probability that a set of n random variables X_1, X_2, ..., X_n takes on specific values x_1, x_2, ..., x_n simultaneously.

Marginal Probability Mass Function

The marginal probability mass function focuses on the probabilities of individual random variables within the joint distribution. It allows us to study the behavior of each variable independently, regardless of the others. In the article's context, this is crucial for estimating the state of a single variable, like a robot's position, while disregarding the other variables. Marginal probabilities are useful in decision-making processes that involve single-variable analysis.

The marginal probability mass function for a specific random variable X_k is denoted as:

P(X_k = x_k)

This represents the probability that the random variable X_k takes on a specific value x_k, regardless of the values of the other random variables in the collection X_1, X_2, ..., X_n.

Conditional Probability Mass Function

The conditional probability mass function represents the probability of one random variable given the value of another random variable. In the context of the article, it is particularly valuable in understanding the relationship and dependencies between different variables. For example, in robotics, it can be used to estimate a robot's position (hidden variable) given sensor measurements (observable variable) by using techniques like Hidden Markov Models (HMMs) or Kalman filters. Conditional probabilities help us make informed decisions based on observed data and understand how the variables interact with each other.

The conditional probability mass function for two random variables X_k and X_j is denoted as:

P(X_k = x_k | X_j = x_j)

This represents the probability that the random variable X_k takes on a specific value x_k, given that the random variable X_j has taken on a specific value x_j. It allows us to study the relationship between the two random variables by conditioning one variable on the other.

Usefulness for Estimation and Detection:

Estimation: Probability mass functions play a critical role in estimation problems, where the goal is to infer unknown parameters or states based on observed data. The joint probability mass function allows us to model the joint distribution of multiple variables, aiding in estimating the state of the entire system. Marginal probabilities are useful for estimating the state of individual variables, while conditional probabilities help us refine these estimates based on additional information, leading to more accurate state estimation.

Detection: In detection problems, we aim to make decisions about the presence or absence of specific signals or patterns in data. Probability mass functions provide the basis for making these decisions probabilistically. By comparing the likelihoods of different hypotheses using the joint or marginal probabilities, we can determine the most probable scenario. Conditional probabilities can be utilized in cases where the presence of one signal affects the probabilities of other signals, leading to improved detection performance.

In summary, the concepts of joint, marginal, and conditional probability mass functions are fundamental in understanding and modeling the behavior of multiple random variables. In detection and estimation theory, they enable us to make informed decisions, estimate unknown parameters, and infer the state of systems with uncertainty. By using these probabilistic tools, engineers can design robust and efficient solutions for various real-world applications, including robotics, communication systems, image processing, and more.

## Detection and Estimation Theory

This theory forms the bedrock of decision making in signal processing and data analysis. It includes two key components: detection, which decides whether a specific signal is present, and estimation, which discerns the values of the signal parameters.

### Terms:

Random Vector: In the context of detection and estimation theory, we often deal with random vectors. A random vector is a set of random variables collected together in a vector. Each random variable within the vector represents a different dimension of the data.

Estimator and Estimate: An estimate is our best guess for the value of a random vector we want to find. An estimator, on the other hand, is a function that allows us to make this guess based on the data we have.

Unbiasedness: an unbiased estimator is the one that expected value of the estimator equals the true parameter value. This means that the estimate for the parameter ithe most accurate one, meaning the estimated parameter close to the true parameter of the underlysing obersevation distribution that we are trying to find.

Minimal variance: an unbiased estimator does not mean it's produce the best bestimate, becauase the estimated parameter can have large variation. So for a estimator to be the best among a class of unbiased estimators, the estimator that has lowest variance is the best estimators.

Minimum Variance Unbiased Estimator (MVUE): It is an unbiased estimator that has the smallest possible variance among all unbiased estimators of the parameter. If an MVUE exists, it's often a preferred choice as it offers both centered accuracy (due to unbiasedness) and precision (due to minimal variance). For example Maximum Likelyhood Estimator is MVUE for a a sequence of I.I.D possion distributed random variables with the underlying distribution parameter > 0.

Estimation of Covariance Matrices: When we're dealing with multivariate data, we often want to estimate the parameters of the covariance matrix. The covariance matrix provides us with important information about the relationships between different dimensions of the data.

Linear and Nonlinear Models: In a linear model, the relationship between the predictor variables and the parameters of the model is linear. If the relationship is not linear, we have a nonlinear model. Estimation methods can vary depending on whether the model is linear or nonlinear. For linear models, methods like linear regression or the basic Kalman filter can be used. For nonlinear models, methods like polynomial regression, logistic regression, or nonlinear versions of the Kalman filter may be used.

### Detection Theory:

The primary goal in detection theory is to decide between two or more hypotheses based on observed data. In the context of communication, for instance, this could involve deciding whether a received signal contains a specific message or just noise.

Key Concepts in Detection Theory:

Hypothesis Testing: This involves testing a null hypothesis (H0) against an alternative hypothesis (H1)

Likelihood Ratio Test (LRT): A fundamental test where the likelihood of the received data under one hypothesis is compared to the likelihood under another hypothesis.

False Alarm and Missed Detection: A false alarm occurs when a signal is detected, but it's actually noise. A missed detection is when a signal is present, but it's interpreted as noise.

### Estimation Theory:

While detection is about deciding between discrete hypotheses, estimation focuses on determining the value of a continuous parameter or a vector of parameters based on observed data.

Key Concepts in Estimation Theory:

Point Estimation: This provides a single "best guess" for the value of the parameter of interest.

Example point estimation would be using MAP

Bayesian Estimation: Incorporates prior knowledge about the parameter in the form of a probability distribution.

Mean Square Error (MSE): A common measure of the quality of an estimator. Lower MSE indicates a better estimator.

Cramér-Rao Lower Bound (CRLB): Provides a lower bound on the variance of any unbiased estimator.

### The Process of Signal Detection and Estimation

In the context of signal processing and data analysis, detection is often the first step, where you decide whether a specific signal is present in the data. Once the presence of the signal is confirmed, the next step is to estimate the parameters of the signal.

Signal Detection and Parameter Estimation

The first step is to detect whether a specific signal is present in the observed data. This typically involves comparing the data to a known pattern (or model) of the signal and deciding whether there is a match. The decision can be made based on a threshold or statistical test..

Parameter Estimation

Once the presence of the signal is confirmed, the next step is to estimate the parameters of the signal. This involves fitting the known pattern (or model) to the observed data in a way that minimizes the difference between the model and the data. Techniques like Maximum Likelihood Estimation (MLE) and Maximum A Posteriori Estimation (MAP) are commonly used for parameter estimation.

Let's consider a simple example. Suppose you are monitoring a communication channel for a specific signal. The signal, if present, is known to follow a certain pattern (or model), but the exact parameters of the pattern (e.g., frequency, amplitude) are unknown.

The first step is to detect whether the signal is present. This typically involves comparing the observed data to the expected pattern and deciding whether there is a match. This decision could be based on a threshold or some statistical test.

Once the signal is detected, the next step is to estimate its parameters. This involves fitting the known pattern (or model) to the observed data in such a way that the difference between the model and the data is minimized. This could be done using various methods, such as least squares, maximum likelihood, or Bayesian methods, depending on the nature of the signal and the noise in the data.

So, in summary, detection and estimation are two key steps in signal processing and data analysis, and they often go hand in hand. The exact methods used for detection and estimation depend on the specifics of the problem, including the nature of the signal, the noise, and the available data.

### Linear Estimation Techniques

Linear estimation involves finding the best linear function that approximates the relationship between the data and the parameters we want to estimate. One common technique in linear estimation is the Linear Minimum Mean-Squared-Error Estimator (also known as the Wiener filter). This technique minimizes the mean-squared error between the true values and the estimates, which makes it optimal under certain conditions.

Mathematically, a linear relationship between data y and parameters x can be represented as:

y=ax+b

where a and b are constants.

However, a nonlinear relationship might look something like:

y=ae^bx or y =a/(1+be^−cx)

where a, b, and c are constants, and e is the base of the natural logarithm.

When such nonlinear relationships exist, linear estimation methods like the basic Kalman filter or linear regression are not adequate. We need to use nonlinear estimation techniques, such as Maximum Likelihood Estimation (MLE), Maximum A Posteriori (MAP) estimation, or non-linear versions of the Kalman filter like the Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF). These techniques are designed to handle the complexities of nonlinear relationships and provide more accurate estimates.

### Nonlinear Estimation Techniques and Reasons

When the relationship between the data and the parameters we want to estimate is non-linear, we often use non-linear estimation techniques such as Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) estimation.

Maximum Likelihood Estimation (MLE): This method estimates the parameters of a statistical model by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. The point in the parameter space that maximizes the likelihood function is called the maximum likelihood estimate.

Maximum A Posteriori Estimation (MAP): This method is a Bayesian-based approach to estimating a distribution that takes into account both the prior distribution (prior beliefs about the parameter) and the likelihood of the observed data. The MAP estimate of a parameter is the mode of the posterior distribution of the parameter.

Both methods are commonly used in machine learning and statistics for parameter estimation.

Explanation of Nonlinear Relationships:

The idea here is to understand the difference between linear and nonlinear methods in the context of parameter estimation, and how optimization comes into play.

When we talk about linear methods, we mean that the parameters we are trying to estimate enter into the model in a linear way. For instance, in a simple linear regression model, the relationship between the predictor variables and the outcome is assumed to be linear, and the parameters (the regression coefficients) can be estimated using linear algebra (specifically, matrix operations).

In mathematical terms, if we denote our predictors by X, our outcome by y, and our parameters (coefficients) by β, the forward model is:

y=Xβ+ϵ(thenoise term)

And the parameter β can be estimated as:

β_hat = (X′X)^−1X′y

β_hat = (X′X)^−1X′y represents the solution for β_hat in ordinary least squares (OLS) regression, where:

X is the matrix of predictors (also known as independent variables or features).

y is the vector of the response variable (also known as the dependent variable or target)

β is the vector of estimated parameters.

X′ denotes the transpose of X

(X′X)^−1 denotes the inverse of X′X and It assumes that X′X is invertibe.

This formula provides the estimated parameters that minimize the sum of the squared residuals, which are the differences between the observed and predicted values of the response variable. Note, ordinary least squares (OLS) method can be sensitive to the noise term, e.

However, in many cases, the relationship between the predictors and the outcome is not linear, or the model itself may be nonlinear in the parameters. In such cases, we cannot use linear algebra to solve for the parameters directly, and instead must use an optimization approach.

In this case, we define a cost function (like the Mean Squared Error (MSE)) and use an optimization algorithm to find the parameters that minimize this cost. This is often done using iterative methods like gradient descent. The optimization problem can be written as:

θ_hat = argminθMSE(y,f(X;θ))

Here, f(X;θ) is our nonlinear model, θ are the parameters of the model, and _hat is the estimated value of the parameters that minimizes the MSE. The

argmin operation means "find the argument that minimizes", and in this context, it means "find the parameters that minimize the MSE".

So, in summary, whether we use linear algebra or optimization to estimate our parameters depends on whether our model and the relationship between our predictors and outcome is linear or nonlinear.

Examples on Nonliear Estimation Methods

In the next section, we'll illustrate the use of a Kalman filter to estimate a signal that has been corrupted by Gaussian noise

The plot will present three separate lines: the true signal, the measured signal, and the estimated signal.

True Signal: This is a constant signal represented by a straight line. In our example, the true signal is an array of ones, so the line will be flat at a height of one unit.

Measured Signal: This is the true signal corrupted by Gaussian noise. Gaussian noise is a type of statistical noise having a probability density function equal to the normal distribution, which is also known as the Gaussian distribution. This will be represented as a wavy line around the true signal, indicating the random variations introduced by the noise.

Estimated Signal: This is the result of applying the Kalman filter to the measured signal. The Kalman filter is an algorithm that uses a series of measurements observed over time (in this case, the measured signal) and produces estimates of unknown variables (in this case, the true signal) that tend to be more precise than those based on a single measurement alone. The estimated signal will be a line that follows the general trend of the measured signal, but is smoother due to the noise reduction performed by the Kalman filter.

in conclusion, detection and estimation theory are essential components of signal processing and data analysis. They enable us to make informed decisions based on observed data, whether it's detecting the presence of specific signals or estimating parameters in statistical models. These techniques find wide application in various fields, including robotics, communications, and image processing, providing valuable insights from noisy and uncertain data.

## Bayesian Inference and Estimation Techniques

Bayes' theorem is a fundamental principle in probability theory and statistics that allows us to update the probability of a hypothesis based on new evidence. In Bayesian inference, we use Bayes' theorem to calculate the updated probability of a hypothesis (H) given some observed evidence (E):

Bayes' Theorem:

P(H|E) = P(E|H) * P(H) / P(E)

Where:

P(H|E): Updated probability of the hypothesis given the evidence (posterior probability).

P(E|H): Likelihood of observing the evidence given the hypothesis.

P(H): Prior probability of the hypothesis (initial belief before considering evidence).

P(E): Probability of observing the evidence.

In Bayesian inference, we use Bayes' theorem to calculate the updated probability of a hypothesis (H) given some observed evidence (E). Bayesian estimation techniques, such as Maximum A Posteriori (MAP) and Markov Chain Monte Carlo (MCMC), utilize Bayes' theorem to update beliefs about unknown parameters or states as more data is gathered.

By incorporating prior knowledge and continuously updating estimates as new data arrives, Bayesian inference and estimation techniques are invaluable for various applications, including robotics, statistical modeling, machine learning, and data analysis.

Here's how to interpret the plot:

Prior: The prior distribution now represents our belief about the robot's location based on a previous location reading. We've modeled this belief as a Gaussian distribution centered around the previous reading, with a standard deviation representing the uncertainty in the previous reading.

Sensor Reading: The sensor reading (indicated by the red dashed line) is a noisy measurement of the robot's location. We simulated this reading by adding Gaussian noise to the true location of the robot.

Likelihood: The likelihood function (the dotted line) represents the probability of the sensor reading given each possible location. It's a probability distribution and one can see that It's highest at the sensor reading and falls off as we move away from that reading.

The likelihood function represents the probability of obtaining the observed data given the parameters of a statistical model. In the context of our robot localization example, the likelihood function is the probability density function of the Gaussian distribution that models our sensor readings. Given a set of locations (0 -100 in our case) and a sensor reading, the likelihood of the sensor reading for each possible location is calculated using the formula for the probability density function (PDF) of a normal (Gaussian) distribution:

f(x | mu, sigma^2) = 1 / sqrt(2 * pi * sigma^2) * exp( - (x - mu)^2 / (2 * sigma^2) )

where:

x is the location,

mu is the mean of the distribution (which is the sensor reading in our case),

sigma^2 is the variance of the distribution (which is the square of the standard deviation of the sensor noise in our case).

Posterior: The posterior distribution (the solid line) represents our updated belief about the robot's location after considering the sensor reading. We calculated it by multiplying the prior and the likelihood at each location (according to Bayes' rule) and then normalizing so that the total probability adds up to 1. The posterior is highest at the location that best balances the information from the prior and the likelihood.

In this example, the prior and the sensor reading are not very close to each other, indicating that the previous location reading and the new sensor reading don't agree perfectly. Despite this disagreement, the Bayesian inference process still enables us to update our belief in a principled way. As we receive more sensor readings, we can continually update the prior and converge on a more accurate estimate of the robot's location.

## Graphical Models

Bayesian networks, Markov random fields, and Hidden Markov Models are all types of graphical models. They are used to represent a set of variables and their conditional dependencies through a directed acyclic graph (DAG). For a set of random variables, a graphical model attempts to capture certain aspects of their joint distribution.

Directed Graphs

A directed graph (also called a "digraph") is a set of vertices and a set of directed edges. Each directed edge has an initial vertex, called its tail, and a terminal vertex, called its head. The directed edge is said to point or go from the tail to the head.

A simple example of a directed graph with three nodes could be represented as follows:

A → B → C

In this graph, we have an edge from A to B and an edge from B to C. This implies that B depends on A, and C depends on B. There's no direct dependence of C on A, or B on C, or A on B.

Undirected Graphs

An undirected graph is a graph in which edges have no orientation. The edge (x,y) is identical to the edge (y,x), i.e., they are not ordered pairs, but sets{x,y} of vertices.

A simple example of an undirected graph with three nodes could be represented as follows:

A ─ B ─ C

In this graph, we have an edge between

A and B and an edge between B and C. Unlike the directed graph, this implies a mutual relationship between A and B and between B and C. This could be interpreted as B being related to A, and C being related to B, but it doesn't imply a direction of influence or dependence as in the directed graph.

In the context of graphical models, directed graphs are often used for causal relationships, as in Bayesian networks, where the direction of the arrow can be interpreted as a cause-and-effect relationship. Undirected graphs are often used to represent relationships of mutual influence, as in Markov networks, where the absence of arrows indicates a bidirectional relationship.

Markov Conditions

In a graphical model, the Markov conditions are a set of conditions that a probability distribution must satisfy in order to be Markov with respect to a graph. These conditions are used to express the notion of conditional independence.

The three Markov conditions are:

Pairwise Markov Condition: Any two non-adjacent nodes are conditionally independent given the set of all other nodes.

Local Markov Condition: A node is conditionally independent of all other nodes given its neighbors.

Global Markov Condition: Any two subsets of nodes are conditionally independent given a separating subset.

The Markov conditions define a Markov Random Field in the undirected graph and a Bayesian Network in the directed graph.

Bayesian Networks (Bayes Net)

A Bayesian network (also known as a Bayes net) is a probabilistic graphical model that uses a directed acyclic graph (DAG) to represent a set of variables and their conditional dependencies. Each node in the graph represents a random variable, while the edges between the nodes represent the conditional dependencies.

Bayesian networks are great for demonstrating concepts of dependence and conditional independence. In a Bayesian network, if there is no path from node

A to node B, the two nodes are conditionally independent.

Here's an example of a simple Bayes Net:

Rain

/ \

/ \

Umbrella Wet Road

In this network, the state of the Umbrella or Wet Road variable depends on the "Rain" variable, and Umbrella are Wet Road independent.

Let's expand on the concept of dependece and conditional independence which is essential to understand and interpretate graphic models:

Conditional Probability

In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event has already occurred. If the event of interest is A and event B is known or assumed to have occurred, the conditional probability of A given B is usually written as P(A∣B).

In the context of graphical models, if there is a directed edge from variable X to variable Y, we interpret this as "the probability of Y depends on ". This can be written as P(Y∣X).

Independence

In probability theory, two events are independent if the occurrence of one does not affect the probability of occurrence of the other (equivalently, does not affect the odds). Similarly, two random variables are independent if the realization of one does not affect the probability distribution of the other.

In graphical models, if there is no edge between two variables, we say that they are independent.

Conditional Independence

Two events A and B are conditionally independent given another event C if the occurrence of A and the occurrence of B are independent events in their conditional probability distribution given C.

In terms of graphical models, if we know the state of a variable, it can render some variables independent. For example, in a Bayesian network, if we know the state of a variable X, and X has children nodes Y and Z, then Y and Z are conditionally independent given X.

Let's illustrate this with an example Bayesian network:

A

/ \

B C

Here, B andC are dependent variables because they both depend onA. However, given A, B and C become conditionally independent. This is because, once

A is known, it 'blocks' the path between B and C, rendering them independent. Mathmatically, his can be expressed as

P(B,C∣A)=P(B∣A)×P(C∣A), says that the joint probability of B (umbrella is used) and C (road is wet) given A (it rained), is equal to the product of the individual probabilities of B and C given A.

This can be interpretated that the probability that the umbrella is used and the road is wet, given that it rained, is equal to the product of the probability that the umbrella is used given that it rained and the probability that the road is wet given that it rained.

In another word, if you know that it rained, the fact that the road is wet doesn't provide any additional information about whether an umbrella was used, and vice versa. Therefore, the two events (umbrella is used, road is wet) are conditionally independent given the event that it rained.

Markov Random Fields (Undirected Graph)

A Markov Random Field (MRF), also known as a Markov network, is a model over an undirected graph. A key property of MRFs is that they obey the Markov property: the value of a particular node is conditionally independent of all other nodes, given its neighbors in the graph.

Example of a Markov Random Field

Markov Random Fields (MRFs) are typically used in computer vision and image processing. Let's consider an example where we use an MRF to model the pixels of a black and white image.

In this case, each pixel is a random variable that can take on two values: 1 (white) or 0 (black). We can create a graph where each node represents a pixel, and nodes are connected if they are neighboring pixels.

The idea is that the color of a pixel is likely to be similar to the color of its neighbors. In terms of the MRF, this means that a pixel is conditionally independent of all other pixels, given the values of its neighbors.

The nodes "Pixel 1", "Pixel 2", "Pixel 3", and "Pixel 4" represent random variables (in this case, image pixels), and the undirected edges between them represent the relationships between neighboring pixels. In this model, the value of each pixel is likely to be similar to the values of its neighboring pixels. For instance, there is an edge between "Pixel 1" and "Pixel 2", indicating that these two pixels are neighbors and therefore their values are likely to be similar.

Interpretation of the diagram:

"Pixel 1" and "Pixel 3" are conditionally independent given "Pixel 2" and "Pixel 4". This is because "Pixel 2" and "Pixel 4" separate "Pixel 1" and "Pixel 3" in the graph, and once their states are known, the states of "Pixel 1" and "Pixel 3" provide no extra information about each other.

"Pixel 2" and "Pixel 4" are conditionally independent given "Pixel 1" and "Pixel 3" for similar reasons.

"Pixel 1" and "Pixel 2" are dependent because they are directly connected in the graph, which means the state of "Pixel 1" can directly influence the state of "Pixel 2" and vice versa. The same applies for all directly connected pairs: "Pixel 2" and "Pixel 3", "Pixel 3" and "Pixel 4", "Pixel 4" and "Pixel 1".

Hidden Markov Models (Directed Graph)

A Hidden Markov Model (HMM) is a statistical model where the system being modeled is a Markov process with unobserved (hidden) states. HMMs are mainly used in the field of temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges, and bioinformatics.

Example of a Hidden Markov Model

Hidden Markov Models (HMMs) are often used in time series analysis and in applications like speech recognition, natural language processing, and bioinformatics.

Consider a simple weather prediction model as an example. Suppose we are trying to predict whether it will be rainy or sunny tomorrow, but we cannot directly observe the weather. Instead, we can observe whether a person carries an umbrella or not.

In this case, the true weather condition (rainy or sunny) is the hidden state, and whether or not the person carries an umbrella is the observable state. We can model the weather condition as a Markov process, meaning that tomorrow's weather depends only on today's weather, and not on the weather of previous days.

The likelihood of the person carrying an umbrella depends on the weather condition. For example, the person is more likely to carry an umbrella if it is rainy.

The graph above represents a Hidden Markov Model (HMM) with both hidden states ("Sunny" and "Rainy") and observed states ("Umbrella" and "No Umbrella").

The top layer represents the hidden states and the bottom layer represents the observed states.

The directed edges represent the transitions between states. For example, the edge from "Sunny" to "Rainy" with a weight of 0.2 represents the probability of transitioning from a sunny day to a rainy day.

The edges from the hidden states to the observed states represent the relationship between the weather conditions and the likelihood of observing someone with or without an umbrella. For instance, the edge from "Rainy" to "Umbrella" with a weight of 0.8 signifies that there is an 80% chance of observing someone with an umbrella given that the day is rainy.

Graphcal models, such as Bayesian networks, Markov random fields, and Hidden Markov Models, are mathematical structures used to model the dependencies among a set of random variables. They are essential tools in machine learning and are used in many engineering applications, including speech recognition, image processing, and bioinformatics.

## Random Processes

A random process, also known as a stochastic process, is a collection of random variables indexed by time or space. The random variables represent the outcomes of the process at different points in time or space. For example, the daily temperature in a location, the stock market prices over time, or the signal received by a sensor are all examples of random processes.

For a given random process, the distribution of outcomes can change over time or space. This is because the underlying conditions that govern the process can change. For instance, in a weather system, the distribution of possible temperatures can change from day to night or from season to season. Similarly, in a stock market, the distribution of price changes can vary based on market conditions, news events, etc.

A random process is said to be stationary if its statistical properties do not change over time. This means that the distribution of outcomes is the same at all points in time. For example, if the daily temperature is equally likely to be any value between 0 and 100 degrees every day of the year, then this would be a stationary process. However, most real-world processes are not stationary because the conditions that drive the process change over time.

There are also processes that are wide-sense stationary (WSS) or weakly stationary, which means that the mean and variance of the process do not change over time, and the autocorrelation between two time points depends only on the difference between the time points, not the actual time at which the correlation is computed. This is a weaker condition than strict stationarity.

On the other hand, ergodic processes are those where the time averages for a single realization converge to the ensemble averages. In other words, given a single, infinitely long realization of the process, we could compute the process's statistical properties accurately.

Note: When the time averages and ensemble averages are the same means the process behaves the same way on average whether we look at one realization over time or many realizations at one moment. This could be interpreted as a kind of repeatability: the process repeats its overall behavior in each realization. This property allows us to make inferences about the overall behavior of the process based on a single, sufficiently long realization.

Let's examined the details:

Time Average: The time average of a random process at a particular point is calculated by observing the process at that point over a period of time. The average of the observed values gives the time average. Essentially, the time average is the average value of the process at a specific point over time.

Ensemble Average: The ensemble average, on the other hand, is calculated across an ensemble of different realizations of the random process at a specific moment in time. Imagine you could replicate the process many times under the same conditions and observe all these copies at the same moment in time. The average of these observations gives the ensemble average.

The concept of ergodicity connects these two types of averages. A random process is said to be ergodic if its time averages are equal to its ensemble averages for all times and all points in space. This means that the long-term behavior of a single realization (time average) is representative of the behavior of the ensemble of realizations (ensemble average). This property is extremely useful because it allows us to make inferences about the overall process based on a single, sufficiently long realization.

Below is an depiction of random walk of time average vs. ensemble average:

The process demonstrated above is called a "Random Walk". It's a simple random process where each step is independent and equally likely to go up (1) or down (-1).

Regarding ergodicity, a simple symmetric random walk (like the one above) is indeed ergodic. The ergodic theorem for random walks states that, given a long enough walk, the average value of the walk (time average) will converge to the expected value (ensemble average). In our example, both the time average and the ensemble average are equal to zero, so the ergodic property holds.

This property is fundamentally important because it allows us to substitute time averages for ensemble averages. In other words, we can observe a single realization of the process for a long time to understand the behavior of the process as a whole. This is crucial in many practical situations where we only have one realization of a process (for example, a single run of an experiment or a single data set) and want to make inferences about the underlying random process.

In summary, the distribution associated with a random process can change over time because the underlying conditions of the process can change. Whether the distribution changes, and how it changes, depends on the specifics of the process and the system it represents. In any case, it's the randomness and the change over time that make a random process interesting and challenging to analyze and predict.

### Random Variables and Random Vectors in the Context of Random Processes

A random variable is generally defined as a variable whose possible values are outcomes of a random phenomenon. It is a function that maps the outcomes of a random experiment to numerical values. For example, in a single roll of a six-sided die, the outcome (which number is face-up) is a random variable. This random variable only "takes a value" when the die is rolled.

In the context of a random process (also known as a stochastic process), you can think of a different random variable being defined at each point in time. For example, in the context of weather forecasting, the temperature at 2 pm is a different random variable than the temperature at 3 pm, even though they may be related.

In a continuous random process, you could think of an uncountably infinite number of random variables: one for every instant in time. For a discrete random process, there might be a countably infinite number of random variables: one for each discrete time step.

In the context of a random process, a random vector can be used to capture the state of the process at multiple points in time or space. In short, it is simply a collection of multiple random variables. X = [X1 , X2 , ... , Xn], X is a random vector or simply put a vector of random variables. For instance, samples of an audio packets stored in a fixed size of buffer[X0, X1, X2, ..Xm] is a random vector consisteing for random variable X(n) = X(nT) (n is the sample index and T is the sampling interval.

It can represent the states of a process at different time points. For example, in observing daily temperatures, the temperatures over a week would form a random vector. Each entry in the vector corresponds to the temperature (a random variable) on a different day.

Note: A random vector is not limited to capturing states at consecutive points in time or space. It could contain the state of the process at any set of points. For example, it could contain the temperature every Monday for a year, or the temperature at noon every day for a month.

A random vector is particularly useful when the random variables it contains are not independent. In this case, the joint distribution of the variables in the vector can capture correlations between the states of the process at different times or places. This can be valuable information for understanding and predicting the process.

Random variable: Temperature on day 101: [38.83] °C

Random vector: Temperatures on week 21: [ 5.09 24.39 21.67 26.35 43.83 29.44 10.87] °C

The plot above represents a random process: daily temperatures over a year. This is a simulated process where temperature values are generated from a normal distribution, reflecting the daily variation in temperature.

A random variable in this context could be the temperature on a specific day. For instance, the temperature on the 101st day is 38.83 °C. It's a random variable because it's an outcome of a random process (daily temperature variations).

A random vector could be the temperatures over a specific week. For instance, the temperatures on the 21st week of the year are [5.09, 24.39, 21.67, 26.35, 43.83, 29.44, 10.87] °C. Each entry in this vector is the temperature on a different day, so the vector gives us a week's worth of temperature data.

### Convergence of a Sequence of Random Variables

In many applications, we are interested in the behavior of a sequence of random variables as it progresses. This leads to the concept of convergence of a sequence of random variables. There are several types of convergence, including almost sure convergence, convergence in probability, and convergence in distribution.

This plot shows an example of convergence in probability. Each point on the plot represents the mean of a random variable drawn from a uniform distribution that ranges from -1/n to 1/n. As n increases, the range of the distribution narrows, and the mean of the distribution converges to 0. This is a demonstration of the law of large numbers, which states that the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

### Jointly Gaussian Random Variables and Vectors

Jointly Gaussian (or multivariate Gaussian) random variables and vectors are an important class of random processes. A set of random variables is jointly Gaussian if any linear combination of them follows a Gaussian distribution. Gaussian processes are widely used in engineering, especially in signal processing and machine learning, due to their mathematical tractability and physical relevance.

The scatter plot shows an example of jointly Gaussian random variables. The equation on the plot describes the properties of the joint Gaussian distribution:

Σ=σij=E[(Xi−μi)(Xj−μj)] is the covariance matrix, which captures the covariance between each pair of variables in the distribution.

Xi∼N(μi,σi^2) means that each variable in the distribution follows a normal distribution with mean μ and variance σ^2

The red ellipse on the plot is a confidence ellipse that contains about 99.7% of the samples, assuming the samples are jointly Gaussian. It provides a visual representation of the covariance matrix: the orientation of the ellipse indicates the correlation between the variables, and the lengths of the axes of the ellipse are proportional to the standard deviations of the variables.

### Random Walks and Brownian Motion

Random walks and Brownian motion are specific types of random processes that are particularly important in fields such as finance, physics, and engineering. A random walk is a sequence of random variables where each step is determined by a random draw from a certain distribution. Brownian motion is a continuous-time random walk, often used to model the motion of particles suspended in a fluid or stochastic market processes.

This plot shows an example of a random walk. At each step, the position changes by +1 or -1 with equal probability. Over time, the position undergoes a "drunkard's walk" pattern, seemingly wandering aimlessly up and down the number line. Despite its simplicity, the random walk model has wide-ranging applications, from modeling stock prices in finance to modeling the motion of gas molecules in physics.

### Stationarity, Wide-Sense Stationarity, and Ergodicity

A random process is said to be stationary if its statistical properties do not change over time. This means that the mean, variance, and autocorrelation function of the process are time-invariant. A weaker form of stationarity process is wide-sense stationary (WSS) because only the process's mean and autocorrelation function are time-invariant. A process is ergodic if its time-averaged behavior is the same as its ensemble-averaged behavior, meaning we can learn about the process's long-term behavior by observing a single realization over a long time period.

Let's demonstrate these properties with a simple example: a sine wave with added Gaussian noise.

A sine wave is deterministic, so it's not a random process and doesn't have the properties of stationarity or ergodicity. However, if we add Gaussian noise to the sine wave, we get a random process. The resulting process is WSS because the mean and autocorrelation function do not change over time. However, it's not stationary because the distribution of the process changes over time (it moves up and down with the sine wave). Finally, it's ergodic because the time averages and ensemble averages are the same.

We'll generate a plot of the process, and compute the time average and ensemble average to show that they're the same, demonstrating ergodicity.

The plot above shows a random process that is a sine wave with added Gaussian noise. The process is wide-sense stationary because its mean (shown by the red dashed line) and autocorrelation function (not shown) do not change over time. The process is not strictly stationary because the distribution of the process changes over time (it moves up and down with the sine wave).

The process is ergodic because the time average (red dashed line) is the same as the ensemble average (green dashed line), meaning we can compute the process's statistical properties from a single, infinitely long realization of the process. In this case, the time average and ensemble average are approximately the same, which is what we expect for an ergodic process.

### Real world Random Processes

### 1. Shot Noise as Poisson Process

The first plot above shows a realization of a Poisson process, where the x-axis represents time and the y-axis represents the number of events that have occurred by each time point. The Poisson process is characterized by a parameter λ, which represents the average rate of events per unit of time. In this simulation, we used λ=10, meaning that on average, 10 events occur per unit of time. So the time average is 10 for possion process.

The second plot shows shot noise derived from the Poisson process. In this model, we assume that each "event" in the Poisson process corresponds to the passage of an electron across a barrier, which generates a noise pulse with a certain amplitude.

Shot noise arises in electronics and telecommunications due to the discrete nature of electric charge. It's called "shot noise" because the charge carriers (like electrons) that carry current across a barrier or junction do so in a manner that can be thought of as a stream of "shots" being fired.

The Poisson process is a simple and commonly used statistical model for systems where events happen sporadically and independently of each other over time. This makes it a good model for shot noise, where each "shot" (or the passage of an electron across a barrier) is an independent event.

Stationarity: The Poisson process is in fact a kind of stationary process. The inter-arrival times between events are exponentially distributed and independent, which means their distribution doesn't change over time. Thus, the underlying distribution does not change over time. The confusion might arise from the fact that the number of events in a given time interval varies, but this variation follows a fixed distribution (the Poisson distribution).

Wide-Sense Stationarity (WSS): The Poisson process is also wide-sense stationary. The mean and variance do not change over time, as they are both equal to the rate parameterλ for a Poisson distribution. The autocorrelation function also only depends on the difference between times, not the actual times, which is another requirement for WSS.

Ergodicity: Poisson process is often ergodic, especially over long periods of time or many realizations, because the time averages (mean, variance, etc.) tend to equal the ensemble averages.

### 2. Thermal Noise as Gausian Process

Thermal noise, also known as Johnson-Nyquist noise, is the electronic noise generated by the thermal agitation of the charge carriers (usually the electrons) inside an electrical conductor at equilibrium, which happens regardless of any applied voltage. The random motion of electrons due to thermal agitation results in a net fluctuation in the voltage or current, which is what we refer to as thermal noise.

Thermal noise is modeled as a Gaussian process because the movement of each electron can be considered independent of the others, and the net result of a large number of independent, random movements is a Gaussian distribution due to the central limit theorem.

The power spectral density of thermal noise is constant and is given by N0 =kT

k is Boltzmann's constant and T is the absolute temperature in Kelvin. This means that the power of the thermal noise is the same at all frequencies, which is why it's often referred to as "white noise."

Let's look at its properties.

Stationarity: A process is said to be stationary if its statistical properties do not change over time. For thermal noise, the mean and variance are constant over time, and the autocorrelation function only depends on the time difference or lag but not the actual time. Therefore, thermal noise is a stationary process.

Wide-Sense Stationarity (WSS): A process is WSS if its mean and autocorrelation function are invariant over time. Since thermal noise is stationary, it is also WSS.

Ergodicity: A process is ergodic if its time averages are the same as its ensemble averages. Because thermal noise is stationary and its statistical properties do not change over time, it is also ergodic. This means that we can estimate its statistical properties by observing it over a long period of time.

### 3. Pink Noise as Fractional Gaussian Process

Pink noise, also known as 1/f noise, is a signal or process that has a frequency spectrum such that the power spectral density (power per frequency interval) is inversely proportional to the frequency of the signal. In pink noise, each octave carries an equal amount of noise power. The name arises from the pink appearance of visible light with this spectral distribution. In terms of random processes, pink noise is a type of fractional Gaussian noise. This means it can be modeled as a Gaussian process, but with a correlation structure that leads to its 1/f power spectrum

Here are some key properties of pink noise:

Spectral Density: The defining characteristic of pink noise is that its power spectral density is inversely proportional to the frequency. This means that it has more power at lower frequencies and less power at higher frequencies.

Stationarity: Pink noise is not strictly stationary because its variance changes over time due to the long-range dependence or correlation present in the noise

Wide-Sense Stationarity (WSS): Pink noise is also wide-sense stationary. Its mean and autocorrelation function are invariant over time.

Ergodicity: Pink noise is typically ergodic, which means its time averages are the same as its ensemble averages. This is because its statistical properties do not change over time, so the long-term behavior of a single realization of the process can represent the behavior of the process as a whole.

In summary, pink noise is a common type of noise found in many natural and man-made systems. Its 1/f power spectrum, long-range dependence, and other properties make it a unique and interesting topic of study in fields like physics, engineering, and data analysis. It is also used in audio and acoustics to create balanced noise signals or to test audio equipment.

### 4. Brownian Motion/ Wierner Process

The plots above display a realization of Gaussian noise and the corresponding Brownian motion derived from that noise. In the Brownian motion plot on the middle, the position at a particular time is the cumulative sum of the Gaussian noise up to that point

Brownian motion is a type of random walk where each step is a Gaussian random variable. It's used to model a variety of phenomena, including the motion of particles in a fluid, stock prices, and more. We've already seen an example of a random walk.

You can actually think of Brownian motion as the integral of Gaussian white noise. Each tiny step the particle takes in a Brownian motion process is a sample of Gaussian noise, and the position of the particle at any point in time is the sum (or integral) of all these steps.

Brownian motion is not stationary, because its increments grow with the square root of time, so its variance increases with time. This violates the requirement for strict stationarity that all moments (including the variance) must be constant over time.

The increments of Brownian motion are wide-sense stationary, rather than the process itself. That is, if you look at the changes from one step to the next, those changes have a constant mean and autocorrelation over time.

Ergodicity is a property of stochastic processes that intuitively means the process doesn't get "stuck" in certain states and that, given enough time, it will explore all possible states. More formally, a process is ergodic if its long-term behavior can be deduced from a single, sufficiently long, random sample path. In other words, the time average equals the ensemble average. However, Brownian motion doesn't satisfy this property. While it's true that Brownian motion doesn't get "stuck" and will explore all possible states given enough time, its variance increases with time. This means that the process becomes more and more spread out as time goes on, and it never settles down to a steady state where the time average equals the ensemble average.

In short, brownmian process has infinite variance. This means that the longer we observe a single realization of the process, the more its path will wander and the more its variance will increase. As a result, the time average of a single realization doesn't stabilize to a single value, but continues to change the longer we observe it. On the other hand, the ensemble average at any given time is always zero (since Brownian motion is symmetric and has zero drift). So in this sense, the time average does not equal the ensemble average.

From a statistic perspection, the reason is that eventhough increments of Brownian motion are independent and identically distributed (i.e.,each step the particle takes is an independent random variable (the step doesn't depend on the previous steps), and all the steps have the same normal distribution)., but the process itself is not independently distributed because The position at time t is dependent on the position at time t−1 . This violated law of large numbers and central limit theorem

### Random Processes Through Linear Systems

Random processes can be passed through linear systems, resulting in new random processes. The output process depends on the input process and the characteristics of the system. This concept is fundamental in fields like signal processing, where we often want to understand the behavior of a random signal (like noise) as it passes through a filter

This plot shows an example of a random process (a sequence of random values, in this case, from a normal distribution) passing through a linear system. The linear system in this case is a moving average filter, which smooths the signal by replacing each point with the average of the surrounding points. The filtered signal, shown in orange, has less high-frequency content and appears smoother than the original signal. This illustrates how a linear system can alter the characteristics of a random process. In real-world applications, this could represent noise reduction or signal conditioning in a signal processing system.

## Probabilistic Methods in State Estimation: From Theory to Real-World Robotics Applications

### Introduction

The field of Robotics heavily relies on probabilistic algorithms for various functions, from perception and decision making to control and interaction. As robots interact with an uncertain and often unpredictable world, accurately estimating their state (position, orientation, speed, etc.) and the state of their surroundings becomes vital for efficient task execution.

### Background Concepts: Random Processes, Nonlinear State Estimation, and Probabilistic Methods

In the application of probabilistic algorithms to robotics, several key concepts form the backbone of our understanding and approach.

Random Processes

Random processes are fundamental to understanding and modeling the inherent uncertainty that exists in real-world robotics applications. This uncertainty stems from various sources, such as sensor noise, actuator errors, model inaccuracies, or an unpredictable environment. Recognizing and accounting for these random processes is vital for creating robust and reliable robotic systems.

Nonlinear State Estimation Methods

In many practical scenarios, the dynamics of a system are not linear. For instance, the movement of a robot often depends on its current position, orientation, speed, and control inputs, all of which can relate in nonlinear ways. Similarly, sensor readings can be nonlinear functions of the state of the robot and the environment.

When the system dynamics are nonlinear, linear methods like the basic Kalman filter cannot provide accurate state estimates. This necessitates the use of nonlinear methods such as the Extended Kalman Filter or the Particle Filter. These methods can handle the nonlinearities and provide more accurate state estimates, making them invaluable tools in robotics.

Probabilistic Methods

Probabilistic methods are employed to handle the inherent uncertainty in robotics. These methods use the principles of probability theory to model and reason about uncertainty. For instance, if a robot's sensors are noisy, the robot can model the sensor readings as random variables with known probability distributions. The robot can then use these distributions to make probabilistic inferences about its state and the state of the environment.

By using probabilistic methods, a robot can quantify its uncertainty and make decisions that account for this uncertainty. For example, if a robot is uncertain about its position, it can choose actions that reduce this uncertainty. Similarly, if a robot is uncertain about the state of the environment, it can choose actions that are likely to succeed under a range of possible environmental states.

### Several probabilistic algorithms and techniques have become particularly important in the field of robotics:

Kalman Filter: This algorithm is a powerful tool for estimating the state of a linear dynamic system from a series of noisy measurements. It forms the backbone of many robotics applications, including navigation, control systems, computer vision, and signal processing. For instance, in a self-driving car, a Kalman filter can combine measurements from different sensors, each with their own uncertainties and errors, to estimate the car's true position and velocity with less uncertainty.

Extended Kalman Filter (EKF): This is an extension of the Kalman filter for non-linear systems. The EKF linearizes the non-linear functions with a first-order Taylor series expansion, which allows for the application of the standard Kalman filter. A typical application is in tasks like Simultaneous Localization and Mapping (SLAM), where a robot or a drone needs to construct a map of an unknown environment while keeping track of its location within that map.

Particle Filter: This is a Sequential Monte Carlo method used to estimate the state of non-linear and non-Gaussian systems. By representing the posterior distribution of the state with a set of random samples or particles, the particle filter can approximate the true posterior. This approach is often used in robot localization problems, where the robot must estimate its pose (position and orientation) given a map of the environment and sensor readings.

Monte Carlo Localization (MCL): This is a specific application of particle filters in robotics. In MCL, a robot begins with a random guess of its position and orientation represented by particles spread across the area of interest. As the robot moves and senses its environment, it updates the weights of the particles based on their match with the sensor readings. Over time, the particles converge towards the robot's actual location.

### State Representation Using Markov Random Fields and Hidden Markov Models

In robotics, understanding and accurately representing the state of a system is a crucial task. To handle the complexities and uncertainties inherent in real-world environments, we often turn to probabilistic graphical models such as Markov Random Fields (MRFs) and Hidden Markov Models (HMMs).

Markov Random Fields (MRFs)

An MRF captures the local dependencies among variables in a system. In state estimation, it can represent the relationships among different components of a robot's state or between the robot and its environment.

In a robot localization problem, the robot's position at different time steps can be represented as an MRF, with each node in the graph corresponding to the robot's position at a given time step and adjacent positions connected by edges. The robot's position at a particular time step depends only on its position at the previous time step, a property that can be leveraged to efficiently estimate the robot's trajectory.

Hidden Markov Models (HMMs)

An HMM is a specific type of MRF where some variables (the state of the system) are hidden or unobservable, and we can only observe the outcomes influenced by these hidden variables. This model is particularly useful when the system's state cannot be directly observed.

For instance, in a robot navigation problem, the robot may not be able to directly observe its position due to sensor limitations or environmental obscurities. However, it can observe landmarks or features in the environment, which are influenced by its position. An HMM can effectively model this situation: the hidden states correspond to the robot's actual position, while the observations represent the detected landmarks or features. Given the observed landmarks and known transition probabilities between positions, the robot can use the HMM to estimate its most probable location.

Let's Illustrate Monte Carlo Localization (MCL) with an 1D Robotic localization Example:

Monte Carlo Localization (MCL), sometimes known as particle filter localization, is a method used to estimate the position and orientation of a robot in a known environment based on a map and sensor measurements. It is a probabilistic method that utilizes a set of particles to represent the possible positions of the robot.

The basic idea behind Monte Carlo Localization is as follows:

Initialization: A set of particles is distributed throughout the environment. Each particle represents a possible position and orientation of the robot.

Motion Update (Prediction): As the robot moves, each particle is moved in a way that approximates the robot's motion. However, some random noise is added to account for uncertainties in the motion.

Measurement Update (Correction): After the motion, the robot takes a sensor measurement. Each particle is then weighted based on how likely it is to produce the observed measurement given its position on the map.

Resampling: Particles with low weights are replaced with particles with high weights. This focuses the set of particles around the most likely positions of the robot.

Let's visualize this step-by-step using a simple 1D example.

### Step 1: Initialization

We'll create a simple 1D world of size 100 units and initialize 1000 particles randomly in this world.

The plot above shows the initial distribution of particles in our 1D world. Each dot represents a particle, and they are uniformly distributed along the entire length of the world.

### Step 2: Motion Update (Prediction)

Let's simulate the robot's movement. We'll assume the robot moves to the right by a certain distance, but there's some noise in the motion. We'll update the position of each particle based on this motion, adding a bit of noise to simulate uncertainty in the robot's movement.

We'll move the robot (and particles) by 20 units to the right, with a motion noise of 2 units.

The plot shows the distribution of particles after the motion update. As expected, the particles have shifted to the right, reflecting the robot's motion. The added noise creates some spread in the particles, simulating the uncertainty in the robot's movement.

### Step 3: Measurement Update (Correction)

Let's assume the robot has a sensor that measures its distance to a landmark located at position 50 in the 1D world. This measurement will also have some noise. After taking the measurement, we will assign a weight to each particle based on how well its position matches the expected measurement.

We'll use a Gaussian function to compute the weight of each particle:2

wi=e^−1/2((Zexpected−Zobserved 2)/σ )^2

where:

wi is the weight of particle i.

zexpected is the expected measurement for particle i.observed

zobserved is the actual measurement taken by the robot.

σ is the standard deviation of the measurement noise.

Let's assume the robot observes a distance of 30 units to the landmark, with a measurement noise (standard deviation) of 2 units.

The plot above shows the particle weights after the measurement update. Particles that are closer to the expected position (given the observed measurement) have higher weights (shown in yellow), while those farther away have lower weights (shown in purple).

### Step 4: Resampling

Based on the weights assigned in the measurement update, we'll now perform the resampling step. This will result in particles with higher weights being duplicated (i.e., they will appear more times in the new set of particles), while particles with lower weights might disappear.

Resampling can be done in various ways, but one common method is the "resampling wheel" algorithm. Let's resample our particles using this method.

The plot above shows the particle distribution after the resampling step. As you can see, particles that had higher weights (i.e., they were closer to the expected position given the observed measurement) are now more numerous in the distribution. Conversely, particles with lower weights are less prevalent or have disappeared.

The clusters you see around positions 20 and 80 can be explained based on the motion and measurement updates.

Motion Update: We moved the particles 20 units to the right. Thus, the particles that were initially around position 0 would now be around position 20 after the motion update.

Measurement Update: The robot measured a distance of 30 units to the landmark located at position 50. This means the robot believes it is located either 30 units to the left of the landmark (i.e., at position 20) or 30 units to the right of the landmark (i.e., at position 80).

Both positions 20 and 80 are consistent with the robot's sensor measurement, and hence you see clusters of particles around these positions after the resampling step.

The particles at position 20 represent the possibility that the robot moved from the beginning of our world (0) directly towards the landmark, while the particles at position 80 represent the possibility that the robot was initially near the landmark and then moved away from it.

This resampling process essentially refines our estimate, making the particle cloud more concentrated around the most likely positions of the robot

### Bayesian Principles in Monte Carlo Localization

The entire process of Monte Carlo Localization is a manifestation of Bayesian reasoning:

Prior belief: Before the motion update, the distribution of particles represents our prior belief about the robot's position.

Prediction: After the motion update, the distribution represents our prediction of the robot's new position given its movement (the prediction step is analogous to the prediction step in the Bayes' theorem).

Measurement likelihood: The weights assigned to each particle after the measurement update represent the likelihood of the observed measurement given the particle's position.

Posterior belief: After resampling, the distribution represents our updated belief (posterior) about the robot's position given the observed measurement.

By repeatedly applying these steps, MCL allows robots to refine their position estimates as they move and sense the environment, all grounded in probabilistic and Bayesian principles.

That's a basic walkthrough of the Monte Carlo Localization process in a 1D world. In real-world scenarios, this would be extended to 2D or 3D, but the principles remain the same.

Finally Let's priovde an 2D Robotic localization Example

The red "X" represents the landmark.

The magenta star marker in the plot represents the estimated location (average of all the particles).

Blue dots represent the particles

The blue circle around the landmark depicts the robot's noisy measured distance to the landmark.

Green arrows indicating the robot's motion direction and magnitude.

Description of What Happened:

Initial Iteration: The particles are spread out uniformly. The robot measures its distance to the landmark (blue circle).

Following Iterations: As the robot moves and senses its environment:

Particles move based on the robot's motion and some noise.

After each measurement, particles closer to the blue circle (measurement) get higher weights.

Resampling results in particles with higher weights becoming more prevalent, refining the robot's position estimate.

Final Iterations: The cloud of particles converges around the robot's actual path, indicating a more confident estimate of its position.

In the provided example, the robot's motion was designed as a rotation of the initial motion vector [10,10] in increments of π/8 radians (or 22.5∘) for each iteration. This creates a pattern where the robot moves in a series of vector rotations, giving the appearance of a spiral or circular trajectory as the iterations progress.

The repeated application of motion updates, measurement updates, and resampling allows the robot to refine its position estimate iteratively, despite uncertainties in motion and measurements. Over time, as the process iterates, the cloud of particles will converge to the true position of the robot, providing an accurate probabilistic estimate of its location in the environment.

### Modern Methods in Robotics: From SLAM to Deep Learning

In the field of robotics, especially in the context of state estimation, localization, and mapping, various modern methods have been developed that build upon and extend the basic principles of probabilistic graphical models like MRFs and HMMs. Some of the modern methods widely adopted are:

Simultaneous Localization and Mapping (SLAM): SLAM is a computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. Various forms of SLAM exist, from feature-based SLAM to direct (dense) SLAM, with numerous algorithms developed for each.

FastSLAM: This is a variant of SLAM that uses a particle filter-based approach for the localization and a low-dimensional Extended Kalman Filter (EKF) for the map estimation. This separation makes FastSLAM computationally efficient.

GraphSLAM: GraphSLAM is a variant of SLAM that formulates the problem as a full maximum likelihood estimation. The whole trajectory of the robot and the full map are estimated globally in this method.

Visual SLAM (vSLAM): In vSLAM, the robot uses a camera (mono or stereo) as the main sensor, and it uses visual features to localize itself and build a map of the environment. Examples of vSLAM include ORB-SLAM, LSD-SLAM, and SVO.

Deep Learning-Based Methods: Recent years have seen the rise of deep learning methods in robotics. These methods use neural networks to learn complex mappings from sensor data to states or actions. For instance, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are used for visual odometry, localization, and mapping.

Reinforcement Learning-Based Methods: Reinforcement learning is a type of machine learning where an agent learns to make decisions by interacting with its environment. In robotics, reinforcement learning methods can be used for tasks such as path planning and navigation.

It's important to note that while these methods have shown impressive results, they are not without their challenges. For instance, deep learning methods require a lot of data and computational resources, and they often lack the interpretability of traditional probabilistic methods. On the other hand, reinforcement learning methods can require a large number of trials to learn a policy, which can be a challenge in real-world robotics applications where safety is a concern.

## Summary and Conclusion

In conclusion, the field of Robotics heavily relies on probabilistic methods, detection and estimation theory, and graphical models to tackle the challenges posed by uncertainty in real-world applications. From perception and decision making to control and interaction, these concepts serve as fundamental tools for designing robust and efficient robotic systems.

Probabilistic methods provide a principled way to model and reason about uncertainty, allowing robots to make informed decisions in complex and uncertain environments. The concepts of random processes and probabilistic reasoning enable robots to quantify uncertainty and choose actions that optimize performance under varying conditions.

Detection and estimation theory play a vital role in signal processing and data analysis, allowing robots to detect the presence of specific signals and estimate their parameters accurately. The ability to detect signals and estimate their parameters is crucial for a wide range of applications, including localization, mapping, and object recognition.

Graphical models, such as Bayesian networks, Markov random fields, and Hidden Markov Models, provide powerful frameworks for modeling dependencies and relationships among variables in robotic systems. They enable robots to represent complex interactions and make predictions based on observed data.

Through the combination of these essential concepts, engineers and researchers can develop sophisticated, autonomous robotic systems capable of navigating uncertain environments, making accurate decisions, and interacting intelligently with the world around them. By harnessing the principles of probability and statistics, robotics continues to advance and transform industries, improving the quality of life and driving innovation across various domains.