Abstract
Sketching has emerged as a powerful technique for speeding up problems in numerical linear algebra, such as regression. In the overconstrained regression problem, one is given an n × d matrix A, with n ≫; d, as well as an n × 1 vector b, and one wants to find a vector ◯ so as to minimize the residual error ||Ax - b||2. Using the sketch and solve paradigm, one first computes S · A and S · b for a randomly chosen matrix S, then outputs x′ = (SA)†Sb so as to minimize ||SAx′ - Sb||2. The sketch-and-solve paradigm gives a bound on ||x′ -x∗||2 when A is well-conditioned. Our main result is that, when S is the subsampled randomized Fourier/Hadamard transform, the error x′ - x∗ behaves as if it lies in a "random" direction within this bound: for any fixed direction a ∈ ℝd, we have with 1 - d-c probability that ha, x′ - x∗<∼. ||a||2||x′ - x∗||2/d 1/2-γ where c,γ > 0 are arbitrary constants. This implies ||x′ - x∗|| ∞ is a factor d 1/2-γ smaller than ||x′ -x∗||2. It also gives a better bound on the generalization of x′ to new examples: if rows of A correspond to examples and columns to features, then our result gives a better bound for the error introduced by sketch-and-solve when classifying fresh examples. We show that not all oblivious subspace embeddings S satisfy these properties. In particular, we give counterexamples showing that matrices based on Count-Sketch or leverage score sampling do not satisfy these properties. We also provide lower bounds, both on how small ||x′ - x∗||2 can be, and for our new guarantee (1), showing that the subsampled randomized Fourier/Hadamard transform is nearly optimal. Our lower bound on ||x′ - x∗||2 shows that there is an O(1/∈) separation in the dimension of the optimal oblivious subspace embedding required for outputting an x′ for which ||x′ - x∗||2 ≤ ∈||Ax∗ - b||2 · ||A†||2, compared to the dimension of the optimal oblivious subspace embedding required for outputting an x′ for which ||Ax′ - b||2 ≤ (1 + ∈)||Ax∗ - b||2, that is, the former problem requires dimension ω(d/∈2) while the latter problem can be solved with dimension O(d/∈). This explains the reason known upper bounds on the dimensions of these two variants of regression have differed in prior work.