Publié dans:

While we have been developing a website that displayed user uploaded images in fixed size regions, we have encountered some interesting problems. And this have led to some questions:

- Should we resize the image to fit the fixed size region ?
- Should we crop the image ?
- Maybe we should mix the two ?

After some tests we found that resizing the image is not the optimal solution, since the fixed size region is quite small. Then I tested the second and the third solution by cropping/resizing manually the images. The tests were convincing !

Needless to say we are not going to crop manually all the images ! we should automate this task. The no-brainer solution is to crop the image by focusing on the center. This naîve solution gave amazingly acceptable results, but sometimes it fails badly if the interesting region lies outside the cropped zone. Our algorithm should be able to somehow locate automatically the interesting part of the image.

Needless to say we are not going to crop manually all the images ! we should automate this task. The no-brainer solution is to crop the image by focusing on the center. This naîve solution gave amazingly acceptable results, but sometimes it fails badly if the interesting region lies outside the cropped zone. Our algorithm should be able to somehow locate automatically the interesting part of the image.

**Information Theoretic solution**

The interestingness of an image is subjective and may vary from one person to another. One way to quantitatively measure the interestingness is to measure the informations contained in that image. I thought that an interesting region of an image is a zone that carries a lot of informations. We need to be able to calculate the information at each individual pixel of the image to find out the information of a particular region.

One could calculate the information of the pixel based on information theoretic definition:

I = -log(p(i)), where i is the pixel, I is the self information and p(i) is the probability of occurrence of the pixel i in our image.

The probability of occurrence is simply the frequency of that particular pixel. An efficient way to calculate the probability is using a normalized histogram. The histogram stores the frequency of an intensity measure of the pixel. In our case we convert the RGB image to the CIELAB color space. A color space invented by the CIE (Commission internationale de l'éclairage). It describes all the colors visible to the human eye.

The problem is reduced to maximizing the total information in a region R(h,w). Or equivalently to find a region of width w and height h with max information. In order to find that region we compute the information per line (i.e. the sum of the info of the pixels in that line) and the information per column.

For this reason you need only to know how to find the maximum sum subsequence for the lines and the columns. Fortunately this is a well known problem that can be solved in linear time O(n).

One could calculate the information of the pixel based on information theoretic definition:

I = -log(p(i)), where i is the pixel, I is the self information and p(i) is the probability of occurrence of the pixel i in our image.

The probability of occurrence is simply the frequency of that particular pixel. An efficient way to calculate the probability is using a normalized histogram. The histogram stores the frequency of an intensity measure of the pixel. In our case we convert the RGB image to the CIELAB color space. A color space invented by the CIE (Commission internationale de l'éclairage). It describes all the colors visible to the human eye.

The problem is reduced to maximizing the total information in a region R(h,w). Or equivalently to find a region of width w and height h with max information. In order to find that region we compute the information per line (i.e. the sum of the info of the pixels in that line) and the information per column.

For this reason you need only to know how to find the maximum sum subsequence for the lines and the columns. Fortunately this is a well known problem that can be solved in linear time O(n).

I used the GO programming language to implement this algorithm:

// The histogram of the image var H [256]float64 // The self-informations of lines/columns Hx := make([]float64, width) Hy := make([]float64, height) // Compute the histogram of the image for x := 0; x < width; x++ { for y := 0; y < height; y++ { // I is the intensity of the pixel // measured in the CIELAB color space I := CalulatePixelIntensity(m, x, y) // Increment the histogram H[I] += 1 } } // // Normalize the histogram // sum := 0.0 for y := 0; y < 256; y++ { sum += H[y] } for y := 0; y < 256; y++ { H[y] = H[y] / sum } // Compute the self-information for line/columns for x := 0; x < width; x++ { for y := 0; y < height; y++ { I := CalulatePixelIntensity(m, x, y) // H[I] = the probability of the pixel // -log(p) = the information in the pixel Hx[x] += -math.Log(H[I]) Hy[y] += -math.Log(H[I]) } } // The x-coordinate of the optimal // cropping region Tx, preservedInfoX := FindMaxSubInterval(Hx, rectX) // The y-coordinate of the optimal // cropping region Ty, preservedInfoY := FindMaxSubInterval(Hy, rectY)

This solution worked quite well for most of time, occasionally on some images it fails to capture the interesting part. This problem is present especially when the image has some gradient colors. The gradient background in those case is not important but has a lot of unique colors. Hence his information measure is quite high.

The following images were rendered using the described algorithm. It’s a heat map representing the self information at each pixel. “Hot” pixel contains a lot of informations.

The following images were rendered using the described algorithm. It’s a heat map representing the self information at each pixel. “Hot” pixel contains a lot of informations.

**Minimizing The gradient**

Another possible measure of the interestingness of a pixel is simply the gradient norm at that pixel. The gradient is a measure of the horizontal and vertical change of the intensity (we use here the CIELAB lightness as the intensity) of the pixel. If the neighboring pixels have the same intensity the gradient norm is zero. We define the interestingness of an image as the sum of the gradients of each individual pixel. The formula of the gradient is :

$ ∇I = \frac { \partial I } { \partial x} e_x +\frac { \partial I}{ \partial y } e_y $

$ ∇I = \frac { \partial I } { \partial x} e_x +\frac { \partial I}{ \partial y } e_y $

And we want to minimize the function :

$ \phi(u,v) = \sum_{i=u}^W \sum_{j=v}^W \|∇I\|^2 $

$ \phi(u,v) = \sum_{i=u}^W \sum_{j=v}^W \|∇I\|^2 $

We can use the previous algorithm to maximize this function in a similar fashion (line/columns). The following images were rendered using the described algorithm: A heat map representing the gradient magnitude at each pixel. “Hot” pixel are the most interesting ones.

**State of the art visual saliency**

Visual salience (or visual saliency) is the distinct subjective perceptual quality which makes some items in the world stand out from their neighbors and immediately grab our attention. [1]

There is several methods used to estimate the saliency of an image. For example some researchers proposed an algorithm based on biological models [2]. This is an interesting approach that combines neuroscience and computer vision. We have chosen to implement another interesting approach that is based on determining the local contrast of a pixel at different scales [3].

At each pixel we compute the local contrast of a region and it’s neighborhood at different scales. The local contrast of a pixel and a surrounding region is the euclidean distance between the pixel and the mean of the region. The pixel values lies in the CIELAB space. We obtain using this method several saliency values at different scales. Those values are summed (pixel wise) to obtain the final saliency value.

There is several methods used to estimate the saliency of an image. For example some researchers proposed an algorithm based on biological models [2]. This is an interesting approach that combines neuroscience and computer vision. We have chosen to implement another interesting approach that is based on determining the local contrast of a pixel at different scales [3].

At each pixel we compute the local contrast of a region and it’s neighborhood at different scales. The local contrast of a pixel and a surrounding region is the euclidean distance between the pixel and the mean of the region. The pixel values lies in the CIELAB space. We obtain using this method several saliency values at different scales. Those values are summed (pixel wise) to obtain the final saliency value.

[1] http://www.scholarpedia.org/article/Visual_salience

[2] L. Itti, C. Koch, & E. Niebur (1998). A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(11):1254-1259.

[3] Salient Region Detection and Segmentation, Radhakrishna Achanta, Francisco Estrada, Patricia Wils, and Sabine SÄusstrunk

## Commentaires