Convexity

Convex Set

Definition: A convex set is a subset of a vector space that, informally, has the property that a straight line drawn between any two points in the set remains entirely within the set. This property can be described as the set ‘enclosing’ all the points on the straight line.

Formal Definition: A set C is convex if, for any two points x1, x2 in C, the line segment λx1+(1−λ)x2 for all λ in the interval [0, 1] is also contained in C.

Remark: Think of λ as a parameter that allows us to choose any point along the line segment. For instance, when λ is 0, we are at the left endpoint of the line segment. Conversely, if λ is 1, we are at the right endpoint. The expression λx1 +(1−λ)x2 should not be perplexing; it simply provides a systematic method to traverse each point on the line segment.

Visualization: Here we will show a visual representation of a convex set compared to a non-convex set.

The convex set (on the left) is depicted as a blue circle, and the red line segment, parameterized by λ, shows that all points between the two endpoints of the line segment are contained within the set.

The non-convex set (on the right) is depicted as a light pink donut, and the same red line segment, again parameterized by λ, demonstrates that some points of the line segment fall outside the set, indicating its non-convex nature.

Convex Function

Definition: A convex function is one where the line segment connecting any two points on the graph of the function does not fall below the graph itself within that interval. Intuitively, this means that the function ‘curves upwards’ and there are no dips below the straight line connecting any two points on the function.

Formal Definition: A function f is convex if its domain is a convex set and for any points x1, x2 in the domain and any λ ∈ [0, 1] f((1-λ)x1 + λx2) ≤ (1 − λ)f(x1) + λf(x2)

Remark: For a function to be convex, it must satisfy two conditions: first, the domain or support where the function is defined must be a convex set, meaning that for any two points within the domain, the entire line segment connecting them also lies within the domain; and second, the function must ‘curve upwards’, such that any line segment connecting two points on the function, f(x1​) and f(x2​), lies entirely on or above the graph of the function for all points between x1​ and x2.

Remark 2: A convex function can be turned into a concave function “upside down U shape” by simply flipping the inequality sign: f((1-λ)x1 + λx2) ≥ (1 − λ)f(x1) + λf(x2), however, the domain remains a convex set.

Visualization of Convex Function: To illustrate this, we will plot a convex function and demonstrate the line segment between two points x1 and x2, showing that all points on the function are below this line segment.

Jensen’s Inequality: F(E(X) <= E[f(X)]

Where F is the convex function, E is the expectation operator and X is the random variable.

Jensen’s Inequality is a fundamental result in the study of convex functions. It provides a way to determine if a function is convex by checking a simple condition. It states that for a convex function f, the function’s value at the mean of a set of points is less than or equal to the mean of the function’s values at those points.

Where X is a random variable and E is the expectation or averaging operator.

The plot demonstrates Jensen’s Inequality: the red point f(E[x]) is below the green dashed line E[f(x)], illustrating that the function of the expected value is less than or equal to the expected value of the function.​

Application to Machine Learning Theory

In machine learning, Jensen’s inequality is used to derive bounds on the performance of learning algorithms. It allows us to relate the expected risk to the empirical risk, which is a cornerstone of statistical learning theory.

Empirical Risk Minimization (ERM): In machine learning, we often aim to minimize the expected risk, which is a measure of how well our model performs on the entire data distribution. However, we don’t have access to the entire data distribution; we only have a sample. So instead, we minimize the empirical risk, which is the average loss over the training sample.

Let’s say our loss function is L (e.g, square loss (y_predict — y)² is a convex function), our hypothesis (or model that predicts an output (e.g, a neural net classifier) is ℎ, and we have a dataset D of points {xi,yi}.

The empirical risk: the average loss over a finite sample of data points from the training set. It’s an estimate of the true risk based on the available data, Remp (h) is given by:

The expected risk R(h): This is the expected loss over the entire distribution of all possible inputs, not just those we have sampled.:

Where D is the underlying data distribution.

Now, if our loss function L is convex, we can apply Jensen’s inequality to relate R(h) and Remp (h.) Let’s assume that our training dataset is an i.i.d. sample from the distribution D. Then, the empirical risk Remp(h) can be viewed as an approximation of the expected risk R(h).

If we were to apply Jensen’s Inequality to the loss function, L()=Z², of the empirical risk, we would say

(E[h(x)−y])² ≤ E[(h(x)−y)²] = R(h)

This tells us that the square of the expected prediction error is less than or equal to the expected value of the squared prediction error. However, we cannot compute the left-hand side because we don’t know the sample distribution, but if we can use the sample or empirical risk as an estimate for the loss for the left-hand side, we use the concept of convexity from Jensen’s Inequality to infer that if we minimize the empirical risk with a convex loss function, we are also likely to minimize the expected risk.

(E[h(x)−y])² ≈ Remp(h) <= R(h)

In summary, while the empirical risk is derived from the training data and may provide an optimistic estimate of model performance, the true risk encompasses all possible data, including unseen data, and is generally equal to or greater than the empirical risk. This discrepancy is known as overfitting, where a model performs well (i.e., lower empirical error) on training data but poorly on new data (i.e., higher expected error).

Mitigation strategy to improve model performance on the general data set: Regularization is a technique used to mitigate overfitting by adding a penalty to the loss function, which encourages the model to be less complex and more generalizable. Doing so helps to ensure that the model’s performance on the training data is more indicative of its performance on unseen data, thus bringing the empirical risk into closer alignment with the true risk and enhancing the model’s expected performance.

Summary and Conclusion

In summary, Jensen’s Inequality provides a powerful lens through which we can view and understand the behavior of convex functions and sets, especially within the scope of machine learning. It offers a foundational principle that the expected risk is bounded by the empirical risk, thereby guiding the development of learning algorithms towards better generalization. This principle assures us that a model trained to minimize empirical risk on a given dataset is likely to perform well on unseen data, making it a pivotal tool in the arsenal of machine learning practitioners.