In practice, we cannot use $h_{\text{Bayes}}$ because we typically don't know $\eta(x)$ or $D$. It might be computationally infeasible to compute $h_{\text{Bayes}}$ even if we knew $D$.
Work hypothesis:
Let's now suppose we have a training set $S=\lbrace (x_{i},y_{i})\rbrace_{i=1}^{m} \sim D^{m}$.
Definition
The empirical error (or training error) of $h$ on $S$ is $$\hat{L}_ {S}(h) = \frac{1}{m} \sum_ {i=1} ^{m} \mathbb{1}(h(x_ {i}) \neq y_ {i}).$$
Note that $L(h) = \mathbb{E}_ {S \thicksim D^m} \hat{L}_{S}(h)$.
Ultimately, we want to bound $L(h) - L_{\text{Bayes}}$.
No Free Lunch Theorem & Implications
No Free Lunch Theorem: For any algorithm $A$ mapping from samples to hypotheses, any $m\in \mathbb{N}$, and any $\epsilon > 0$, there exists a distribution $D$ such that $L_{\text{Bayes}} = 0$ and $\mathbb{E} _{S \thicksim D^m}\hat{L}_S(A(S)) \geq \frac{1}{2} - \epsilon$.
Let $\mathcal{H}$ be a hypothesis class containing $h : \chi \to \lbrace -1, 1 \rbrace$. For example, $\mathcal{H}$ might be the set of all neural networks, linear classifiers, or polynomials.
Let $h^* = \underset{h \in \mathcal{H}}{\arg \min} L(h)$. Then,
where the estimation error measures how close we are to the best $h^{*} \in \mathcal{H}$ and the approximation error measures how good the class $\mathcal{H}$ is. This decomposition is illustrated below (green and orange denote estimation and approximation errors, respectively).
Increasing the size of $\mathcal{H}$ increases the estimation error and decreases the approximation error.
Bias-Variance Tradeoff
We want to pick $\mathcal{H}$ that minimizes the total error. The relationship between estimation, approximation, and total errors is illustrated below. For intuition, see Occam's Razor. You can also see Bias-Variance Tradeoff