Descriptors

This section breaks down the SIFT descriptor that Sara implements.

In the sequel, let us denote an oriented scale-invariant local keypoint by \(k = (x, y, \sigma, \theta)\).

SIFT

Interpretation

SIFT is a feature vector that encodes the photometric information of an image patch into histograms of gradients.

The image patch is divided into a 2D grid of \(N \times N\) square image patches, from which we calculate histogram of gradients. Each histogram bins gradient orientations into \(O\) principal orientations. A single SIFT descriptor about can be viewed as a 3D tensor \(\mathbf{h} \in \mathbb{R}^{N \times N \times O}\).

The component \(\mathbf{h}[i, j, o]\) quantifies the frequency of orientation \(\frac{2 o \pi}{O}\ \text{rad}\) in the \((i, j)\)-th histogram, where \({0 \leq i,j < N}\) and \(0 \leq o < O\).

Note

In the original implementation [Lowe:2004:ijcv] the parameters are set as \(N = 4\) and \(O = 8\). And a SIFT descriptor is a feature vector \(\mathbf{h} \in \mathbb{R}^{128}\).

Important

A peculiarity with SIFT is that image patches that form the grid overlap and that the orientation bins overlap as well because of the trilinear interpolation trick as we will see later.

Design Details

The design of SIFT is geared towards robustness w.r.t. noisy estimation of keypoint position, scale and orientation, lossy and noisy image acquisition, illumination changes, etc.

The SIFT can encode the photometric information of the region around the keypoint \(k\) in a similarity-invariant manner if we make use of the scale \(\sigma\) and the dominant gradient orientation \(\theta\).

By similarity invariance, we mean that

  • if we take multiple photographs \(I_i\) of the same salient region, e.g., a corner or blob, and
  • if we have a perfectly accurate feature detector and orientation estimator

then we would detect that salient region in each image as a keypoint \(k_i\) with perfect pixel localisation, scale and orientation.

Normalizing the image patch around this salient region with scale \(\sigma_i\) and orientation with \(\theta_i\) would yield the same image. The SIFT descriptors \(\mathbf{h}_i\) simply compress the relevant image information in each normalized patch around the keypoint \(k_i\) and are thus expected to be identical.

This leads us to define the local coordinate system and its associated normalizing transform.

Local Coordinate System and Normalizing Transform

The orientations are calculated and accumulated with respect to the local coordinate system \(\mathcal{C}_k\) associated to keypoint \(k\). This local coordinate system is equivalently characterized by a similarity transform \(\mathbf{T}\) which we define as follows.

Important

Let \(\lambda_\text{zoom}\) be a magnification factor. The scale of the normalizing transform is \(s = \lambda_\text{zoom} \sigma\).

Thus in matrix notation and using homogeneous coordinates, the normalizing transform \(\mathbf{T}\) transforms image coordinates \(\mathbf{u} = (u, v)\) to normalized coordinates \(\tilde{\mathbf{u}} = (\tilde{u}, \tilde{v})\) as follows

\[\begin{bmatrix} \tilde{\mathbf{u}} \\ 1 \end{bmatrix} = \mathbf{T} \begin{bmatrix} \mathbf{u} \\ 1 \end{bmatrix} = \displaystyle \frac{1}{s} \left[ \begin{array}{c|c} \mathbf{I}_2 & \mathbf{0}_2 \\[0.1em] \hline \\[-1em] \mathbf{0}_2^T & s \end{array} \right] \left[ \begin{array}{c|c} \mathbf{R}_{\theta}^T & -\mathbf{R}_{\theta}^T \mathbf{x} \\[0.1em] \hline \\[-1em] \mathbf{0}_2^T & 1 \end{array} \right] \begin{bmatrix} \mathbf{u} \\ 1 \end{bmatrix}\]

Grid of Overlapping Patches and Overlapping Histogram Bins

To compute a SIFT descriptor we consider an oriented square image patch \(\mathcal{P}\):

  • centered in \(\mathbf{x} = (x, y)\) and,
  • oriented with an angle \(\theta\) w.r.t. the image axes to enforce invariance to rotation,

Note

We divide the square patch \(\mathcal{P}\) into a grid of \(N \times N\) overlapping square patches \(\mathcal{P}_{ij}\) for \({0 \leq i,j < N},\) each of them having a radius \(s = \lambda_{\text{zoom}} \sigma\) pixels. This means that the whole square patch \(\mathcal{P}\) has a side length equal to \((N + 1) s\)

In the original implementation of [Lowe:2004:ijcv] the magnification factor \(\lambda_{\text{zoom}}\) is set to \(3\).

Now let us explain why the overlapping is due to the trilinear interpolation before we detail the geometry of the patches \(\mathcal{P}_{ij}\) in the next section.

Quoting [Lowe:2004:ijcv] the trilinear interpolation is key to avoid the following: “It is important to avoid all boundary affects in which the descriptor abruptly changes as a sample shifts smoothly from being within one histogram to another or from one orientation to another.”

First trilinear interpolation compensates for the noisy estimation of keypoint \(k\). In particular a pixel in the patch \(\mathcal{P}\) can belong up to \(4\) adjacent patches \(\mathcal{P}_{i'j'}\) for \(i \leq i' \leq i + 1\) and \(j \leq j' \leq j + 1\) to compensate for the noisy scale \(\sigma\) and location \(\mathbf{x}\). In other words, patches overlap.

Then we encode the photometric information of each patch \(\mathcal{P}_{ij}\) into a histogram \(\mathbf{h}_{ij} \in \mathbb{R}^O\) of gradient orientations, where the orientations are binned into \(O\) principal orientations. Again with trilinear interpolation, we compensate for the noisy orientation \(\theta\) and as a result

Note

The histogram bins \(\mathbf{h}[i, j, o]\) overlap as each of them covers the interval of orientations \([\frac{2 \pi (o - 1)}{O}, \frac{2 \pi (o + 1)}{O}]\).

Geometry of Overlapping Patches

In this local coordinate system \(\mathcal{C}_k\):

  • the keypoint center is at the origin \((0, 0)\),

  • each patch \(\mathcal{P}_{ij}\) can be viewed as a patch with side length \(2\),

  • each patch has:

    • its top-left corner at

      \[\mathbf{c}_{ij}^{\text{tl}} = [j, i] - \frac{N + 1}{2} [1, 1]\]
    • its bottom-right corner at

      \[\mathbf{c}_{ij}^{\text{br}} = [j, i] - \frac{N - 3}{2} [1, 1]\]
    • its center at

      \[\mathbf{c}_{ij} = [j, i] - \frac{N - 1}{2} [1, 1]\]

Note

Clearly a patch \(\mathcal{P}_{ij}\) is a closed ball centered in \(\mathbf{c}_{ij}\) and with radius \(1\) for the \(\ell_1\) norm in this coordinate system.

The patches overlap by construction because the centers are laid on a 2D grid with step size \(1\). The overlapping helps to make the SIFT descriptor more robust to the noisy estimation of the keypoint position, scale and orientation.

The patch centers \(\mathbf{c}_{ij}\) coincide with the histogram indices \((i, j)\) if we shift each coordinate with \(\frac{N - 1}{2}\).

This observation will be useful in the SIFT implementation to determine which histogram bins needs to be accumulated.

Thus the centers are

\[\left[ \begin{array}{c|c|c|c} (-1.5,-1.5) & (-0.5,-1.5) & (+0.5,-1.5) & (+1.5,-1.5) \\ \hline (-1.5,-0.5) & (-0.5,-0.5) & (+0.5,-0.5) & (+1.5,-0.5) \\ \hline (-1.5,+0.5) & (-0.5,+0.5) & (+0.5,+0.5) & (+1.5,+0.5) \\ \hline (-1.5,+1.5) & (-0.5,+1.5) & (+0.5,+1.5) & (+1.5,+1.5) \\ \end{array} \right]\\ \]

Equivalently with array broadcasting (cf. Numpy concept):

\[\left[ \begin{array}{c|c|c|c} [0, 0] & [1, 0] & [2, 0] & [3, 0] \\ \hline [0, 1] & [1, 1] & [2, 1] & [3, 1] \\ \hline [0, 2] & [1, 2] & [2, 2] & [3, 2] \\ \hline [0, 3] & [1, 3] & [2, 3] & [3, 3] \end{array} \right] - [1.5, 1.5] \]

SIFT Coordinate System

Let us consider a pixel \((u, v)\) in the patch \(\mathcal{P}\):

\[\left\{ \begin{aligned} & \displaystyle x - r \leq u \leq x + r \\ & \displaystyle y - r \leq v \leq y + r \\ \end{aligned} \right. \Longleftrightarrow \left\{ \begin{aligned} & \displaystyle -\frac{N + 1}{2} \leq \tilde{u} \leq \frac{N + 1}{2} \\ & \displaystyle -\frac{N + 1}{2} \leq \tilde{v} \leq \frac{N + 1}{2} \end{aligned} \right.\]

Introducing shifted coordinates which I choose to call these SIFT coordinates \(\hat{\mathbf{u}} = (\hat{u}, \hat{v})\)

\[\left\{ \begin{aligned} \hat{u} &= \tilde{u} + \frac{N - 1}{2} \\ \hat{v} &= \tilde{v} + \frac{N - 1}{2} \end{aligned} \right. \]

We see equivalently

\[\left\{ \begin{aligned} -1 \leq \hat{u} \leq N \\ -1 \leq \hat{v} \leq N \end{aligned} \right. \]

Note

In matrix notation and using homogeneous coordinates, the normalizing transform \(\mathbf{T}_\text{SIFT}\) transforms image coordinates \(\mathbf{u}\) to SIFT coordinates \(\hat{\mathbf{u}}\) as follows

\[\begin{bmatrix} \hat{\mathbf{u}} \\ 1 \end{bmatrix} = \mathbf{T}_{\text{SIFT}} \begin{bmatrix} \mathbf{u} \\ 1 \end{bmatrix} = \displaystyle \underbrace {\left[ \begin{array}{ccc} 1 & 0 & \frac{N - 1}{2} \\[0.1em] 0 & 1 & \frac{N - 1}{2} \\[0.1em] 0 & 0 & 1 \\[0.1em] \end{array} \right]}_{\text{shift}} \mathbf{T} \begin{bmatrix} \mathbf{u} \\ 1 \end{bmatrix}\]

The floored coordinates satisfies:

\[\left\{ \begin{aligned} -1 \leq \lfloor \hat{u} \rfloor \leq N \\ -1 \leq \lfloor \hat{v} \rfloor \leq N \end{aligned} \right.\]
\[\left\{ \begin{aligned} \lfloor \hat{u} \rfloor \leq \hat{u} < \lfloor \hat{u} \rfloor + 1 \\ \lfloor \hat{v} \rfloor \leq \hat{v} < \lfloor \hat{v} \rfloor + 1 \end{aligned} \right.\]

The pixel \((u, v)\) belongs up to \(4\) patches:

  • \(\mathcal{P}_{ \lfloor \hat{v} \rfloor , \lfloor \hat{u} \rfloor }\)
  • \(\mathcal{P}_{ \lfloor \hat{v} \rfloor , \lfloor \hat{u} \rfloor + 1}\)
  • \(\mathcal{P}_{ \lfloor \hat{v} \rfloor + 1, \lfloor \hat{u} \rfloor }\)
  • \(\mathcal{P}_{ \lfloor \hat{v} \rfloor + 1, \lfloor \hat{u} \rfloor + 1}\)

We say “up to \(4\)” because for example a gradient at the boundary \((-1,-1)\) contributes only to \(\mathcal{P}_{00}\).

Histogram of Gradients

Consider a pixel \(\mathbf{u} \in \mathcal{P}_{ij}\). Its contribution in histogram \(\mathbf{h}_{ij}\) is

\[\boxed{ w(\mathbf{u}) = \underbrace{ \exp \left( - \frac{\| \tilde{\mathbf{u}} \|^2}{2 (N/2)^2} \right) }_{\text{distance to center}} \underbrace{ \| \nabla g_\sigma * I(\mathbf{u}) \|_2 }_{\text{gradient magnitude}} }\]

[Lowe:2004:ijcv] chooses to give more emphasis to gradients close to the keypoint center \(\mathbf{x}\) to compensate for the noisy estimation of keypoint.

With the trilinear interpolation, its contribution to \(\mathbf{h}_{ij}\) becomes:

\[\displaystyle w_{\text{final}}(\mathbf{u}) = w(\mathbf{u}) \left( 1 - |\hat{u} - j| \right) \left( 1 - |\hat{v} - i| \right) \]

In the local coordinate system, the orientation of the gradient \(\nabla I_\sigma(\mathbf{u})\) is calculated as:

\[\phi = \text{atan2}(\nabla I_\sigma(\mathbf{u})) - \theta \\ \]

The normalized orientation is:

\[\hat{\phi} = \phi \frac{O}{2 \pi} \]

It falls within two orientation bins:

  • \(\mathbf{h}[i, j, o]\)
  • \(\mathbf{h}[i, j, o+1]\)

where \(o = \lfloor \hat{\phi} \rfloor\).

The contribution will be distributed to the two bins as follows

\[\displaystyle \mathbf{h}[i, j, o] \leftarrow \mathbf{h}[i, j, o] + w_\text{final}(\mathbf{u}) \left(1 - (\hat{\phi} - o) \right) \\ \displaystyle \mathbf{h}[i, j, o + 1] \leftarrow \mathbf{h}[i, j, o + 1] + w_\text{final}(\mathbf{u}) \left( \hat{\phi} - o \right)\]

Note

The histogram bins have an explicit formula but it is not efficient to calculate it as is:

\[\displaystyle \mathbf{h}[i, j, o] = \sum_{\mathbf{u} \in \mathcal{P}_{ij}} w(\mathbf{u}) \left( 1 - | \hat{u} - j | \right) \left( 1 - | \hat{v} - i | \right) \left( 1 - | \hat{\phi} - o | \right) \mathbf{1}_{\left|\ \hat{\phi} - o \right| < 1} \]

Sketch of Implementation

The last paragraph gives enough insights as for how to compute the SIFT descriptor. It is easy to show that we need to scan all the pixels on a large enough image patch, e.g., radius

\[\boxed{r = \sqrt{2} \frac{N + 1}{2} \lambda_{\text{zoom}} \sigma}\]

In the above formula, notice that:

  • the factor \(\sqrt{2}\): because the square patches are oriented with an angle \(\theta \neq 0\), we need to make sure we are not missing any pixels in particular the corners of the patches; granted we consider pixels that are outside the patch domain and possibly there could be many of them that need to be discarded.
  • if the orientation \(\theta\) was zero, we could check that a radius \(r = \frac{N + 1}{2} \lambda_{\text{zoom}} \sigma\) would have been sufficient.
  • the factor \(\frac{(N + 1)}{2}\): this accounts for gradients for patches “at the border” of the image patch \(\mathcal{P}\). These gradients “at the border” may belong to only one histogram (“at the corners”) or two histograms (“at the edges”).

The SIFT descriptor for keypoint \(k\) is calculated as follows:

Important

  • For each pixel \(\mathbf{u} = (u, v) \in [x-r,x+r] \times [y-r, y+r]\):
    • calculate the gradient \(\nabla g_\sigma * I (\mathbf{u})\)
    • express its orientation w.r.t. the local coordinate system \(\mathcal{C}_k\)
    • calculate the contribution \(w(\mathbf{u})\) of the gradient.
    • accumulate histograms using trilinear interpolation (to be cont’d)

Note

In practice the gradients are precomputed only once in polar coordinates for efficiency at every scale of the Gaussian pyramid.

The computation of SIFT in C++ can be sketched as follows:

using descriptor_type = Eigen::Matrix<float, 128, 1>;

auto compute_sift_descriptor(float x, float y, float sigma, float theta,
                             const Image<Vector2f, 2>& grad_polar_coords)
    -> descriptor_type
{
  constexpr auto lambda_zoom = 3.f;

  // The SIFT descriptor.
  descriptor_type h = descriptor_type::Zero();

  // The radius of each overlapping patches.
  const auto s = lambda_zoom * sigma;

  // The radius of the total patch.
  const auto r = sqrt(2.f) * s * (N + 1) / 2.f;

  // Linear part of the normalization transform.
  const Eigen::Matrix2f T = Eigen::Rotation2D<float>{-theta}::toRotationMatrix() / s;

  // Loop to perform interpolation
  const auto rounded_r = static_cast<int>(std::round(r));
  const auto rounded_x = static_cast<int>(std::round(x));
  const auto rounded_y = static_cast<int>(std::round(y));
  for (auto v = -rounded_r; v <= rounded_r; ++v)
  {
    for (auto u = -rounded_r; u <= rounded_r; ++u)
    {
      // Retrieve the normalized coordinates.
      const Vector2f pos = T * Vector2f(u, v);

      // Boundary check.
      if (rounded_x + u < 0 || rounded_x + u >= grad_polar_coords.width() ||
          rounded_y + v < 0 || rounded_y + v >= grad_polar_coords.height())
        continue;

      // Gaussian weight contribution.
      const auto weight = exp(-pos.squaredNorm() / (2.f * pow(N / 2.f, 2)));

      // Read the precomputed gradient (in polar coordinates).
      const auto mag = grad_polar_coords(rounded_x + u, rounded_y + v)(0);
      auto ori = grad_polar_coords(rounded_x + u, rounded_y + v)(1) - theta;

      // Normalize the orientation.
      ori = ori < 0.f ? ori + 2.f * pi : ori;
      ori *= float(O) / (2.f * pi);

      // Shift the coordinates to retrieve the "SIFT" coordinates.
      pos.array() += N / 2.f - 0.5f;

      // Discard pixels that are not in the oriented patch.
      if (pos.minCoeff() <= -1.f || pos.maxCoeff() >= static_cast<float>(N))
        continue;

      // Accumulate the 4 gradient histograms using trilinear interpolation.
      trilinear_interpolation(h, pos, ori, weight, mag);
    }
  }

  return h;
}

Trilinear Interpolation

We can sketch the trilinear interpolation in C++ as follows:

void trilinear_interpolation(descriptor_type& h, const Vector2f& pos, float ori,
                             float weight, float mag)
{
  const auto xfrac = pos.x() - floor(pos.x());
  const auto yfrac = pos.y() - floor(pos.y());
  const auto orifrac = ori - floor(ori);
  const auto xi = static_cast<int>(pos.x());
  const auto yi = static_cast<int>(pos.y());
  const auto orii = static_cast<int>(ori);

  auto at = [](int y, int x, int o) { return y * N * O + x * O + o; };

  for (auto dy = 0; dy < 2; ++dy)
  {
    const auto y = yi + dy;
    if (y < 0 || y >= N)
      continue;

    const auto wy = (dy == 0) ? 1.f - yfrac : yfrac;
    for (auto dx = 0; dx < 2; ++dx)
    {
      const auto x = xi + dx;
      if (x < 0 || x >= N)
        continue;

      const auto wx = (dx == 0) ? 1.f - xfrac : xfrac;
      for (auto dori = 0; dori < 2; ++dori)
      {
        const auto o = (orii + dori) % O;
        const auto wo = (dori == 0) ? 1.f - orifrac : orifrac;
        // Trilinear interpolation:
        h[at(y, x, o)] += wy * wx * wo * weight * mag;
      }
    }
  }
}

Robustness to illumination changes

[Lowe:2004:ijcv] explains that:

  • a brightness change consists in adding a constant factor to image intensities. And image gradients cancels this constant factor so SIFT is invariant to brightness change by construction.
  • a contrast change in image amounts to multiplying image intensities by a constant factor. Normalizing the descriptor cancels the multiplication factor. So we must normalize the descriptor once the histogram of gradients are accumulated.
  • There are still other nonlinear illumination changes. They arise for example from camera saturation and surface reflective properties. [Lowe:2004:ijcv] have found experimentally that (1) clamping histogram bins to \(0.2\) and then (2) renormalizing the descriptor again worked well to account for these on a dataset consisting of 3D objects photographed under different lighting conditions.

Using Eigen, we can express these in C++:

auto enforce_invariance_to_illumination_changes(descriptor_type& h) -> void
{
  // SIFT is by construction invariant to brightness change since it is based
  // on gradients.

  // Make the descriptor robust to contrast change.
  h.normalize();

  // Apply the following recipe for nonlinear illumination change.
  //
  // 1) Clamp the histogram bin values to 0.2
  h = h.cwiseMin(descriptor_type::Ones() * _max_bin_value);
  // 2) Renormalize again.
  h.normalize();
}