EE227C course page
Download ipynb file
Last lecture, we saw the $\ell_1$-relaxation approach to solving sparse linear systems. Now we will experimentally explore how fast we can solve the corresponding optimization problems. First, we treat the $\ell_1$-relaxation as a standard convex problem. To speed things up, we then turn to a first order approach (projected gradient descent) and find that it works directly with projections onto the non-convex set.
%matplotlib notebook
import math
from timeit import default_timer as timer
import cvxpy # convex optimization library
import matplotlib.pyplot as plt
import numpy as np
from plotters import kwargs, setup_layout
setup_layout()
Let's set up a simple test problem of the form $y = A x$ with $s$-sparse $x$. As usual, the matrix $A$ has dimension $n \times d$. We sample the entries of $A$ as i.i.d. Gaussians so the matrix satisfies the restricted isometry property (RIP) if we take enough samples $n$.
d = 100
n = 50
s = 10
A = np.random.randn(n, d)
A /= math.sqrt(n) # Normalization so the matrix is RIP (optional)
support = np.random.choice(d, s, replace=False)
x_star = np.zeros(d)
x_star[support] = np.random.randn(s)
y = np.dot(A, x_star)
We write $\ell_1$-minimization as a convex optimization problem using CVXPY. That way we don't have to worry about the algorithmic details for now (CVXPY uses an interior point method internally).
def l1min(y, A):
x_hat = cvxpy.Variable(A.shape[1])
objective = cvxpy.Minimize(cvxpy.norm(x_hat, 1))
constraints = [A * x_hat == y]
prob = cvxpy.Problem(objective, constraints)
prob.solve()
return np.squeeze(np.array(x_hat.value)), (prob.value, prob.status)
Let's see if we get an accurate solution, e.g., in terms of relative error.
x_hat, _ = l1min(y, A)
def relative_error(x_hat, x_star):
return np.linalg.norm(x_hat - x_star) / np.linalg.norm(x_star)
relative_error(x_hat, x_star)
Looks good!
Next, let's check how many samples we need empirically for the $\ell_1$-relaxation to work.
For this, we repeat the recovery experiment above for an increasing number of samples from $n = 10$ to $n = 100$ (when the system becomes fully determined).
For each value of $n$, we run 20 trials and count how man of them succeed (relative recovery error less than $10^{-6}$).
d = 100
n_vals = np.linspace(10, 100, 20, dtype=np.int64)
s = 10
num_trials = 10
num_correct = np.zeros(len(n_vals))
for ii, n in enumerate(n_vals):
for jj in range(num_trials):
A = np.random.randn(n, d)
A /= math.sqrt(n)
support = np.random.choice(d, s, replace=False)
x_star = np.zeros(d)
x_star[support] = np.random.randn(s)
y = np.dot(A, x_star)
x_hat, _ = l1min(y, A)
if relative_error(x_hat, x_star) < 1e-6:
num_correct[ii] += 1
num_correct /= num_trials
Let's plot the empirical recovery probability as a function of $n$.
plt.figure()
plt.plot(n_vals, num_correct)
plt.xlabel('n')
plt.ylabel('Fraction correct')
plt.show()
Recall that an i.i.d. Gaussian matrix requires $O(s \log d / s)$ rows so it satisfies RIP (this corresponds to $n = O(s \log d / s)$ samples here). So the recovery phase transition occurs rougly where we'd expect (a small multiple of $s = 10$).
Next, let's see how fast CVXPY solves our problem.
d = 4000
n = 2000
s = 10
A = np.random.randn(n, d)
A /= math.sqrt(n)
support = np.random.choice(d, s, replace=False)
x_star = np.zeros(d)
x_star[support] = np.random.randn(s)
y = np.dot(A, x_star)
start = timer()
x_hat, _ = l1min(y, A)
end = timer()
print('Relative error {}, took {} seconds'.format(relative_error(x_hat, x_star), end - start))
This is starting to look somewhat slow:
As we will see later in the course, this is to be expected if we use interior point methods without exploiting further problem structure. While interior point methods reliably give us high-accuracy solutions, they can also be prohibitively slow for large-scale problems (in MRI reconstruction, we might easily have $d = 10^6$).
So if we want to solve very large instances, we should also consider first-order methods. In fact, it turns out that there is an interesting phenomenon here: instead of solving the convex relaxation via first order methods, let's see what happens if we directly run projected gradient descent with the non-convex set of sparse vectors.
Projected gradient descent for sparse vectors is also known as Iterative Hard Thresholding (IHT) because the projection step (find the closest $s$-sparse vector) corresponds to hard thresholding the vector. In particular, we keep only the $s$ largest entries (in absolute value) and set the rest to 0.
def hard_thresholding(x, s):
result = np.zeros_like(x)
large_indices = np.argpartition(np.abs(x), x.size - s)[x.size - s:]
result[large_indices] = x[large_indices]
return result
def IHT(y, A, s, step_size=1.0, max_iter=100):
x_hat = hard_thresholding(np.dot(A.T, y), s)
iterates = [x_hat]
for num_iter in range(1, max_iter):
step = np.dot(A.T, np.dot(A, x_hat) - y)
x_hat = hard_thresholding(x_hat - step_size * step, s)
iterates.append(x_hat)
return x_hat, np.array(iterates)
Let's see how much faster the first-order approach is.
start = timer()
x_hat, iterates = IHT(y, A, s, max_iter=10)
end = timer()
print('Relative error {}, took {} seconds'.format(relative_error(x_hat, x_star), end - start))
More than $1000 \times$ faster - nice! But is IHT guaranteed to work? After all, the set of $s$-sparse vectors is not convex.
We'll prove this fact on the board next.