To Correctly Track the Hazards in Front of the Ego Vehicle Using Monocular Vision
Contents
Abstract
Detection of features, description and feature matching are important elements in the field of computer vision. The advancement of technology has contributed to the considerable amount of interest in this field. In this literature, an overview of several existing detectors and descriptors of features is given. Furthermore, explanation of two common methodologies (Brute-Force and FLANN) to feature matching is discussed. Finally, the performance of the two feature matching methods is analysed.
Introduction
In the area of computer vision, object tracking has gained much importance. The reason for this could be accorded to the easy availability of powerful computers, good quality video cameras that are affordable and necessity to automate the process of video analysis. The main steps in the analysis of a video are: detection of objects that are moving, object tracking in each of the frames and recognising the behaviour of the objects by analysing their tracks.
Object tracking is appropriate for: recognition of an object using motion; automatically observing and detecting any questionable activity; the interaction between humans and computers to recognise actions; observing of traffic in real-time to be able to collect traffic data and use it for managing the flow of traffic; navigation of vehicles to enable plotting of route and capability to avoid obstacles. Tracking is defined as the process of assessing the route of an object when it is moving near a scene.
Feature selection for tracking: Tracking effectively is dependent on selecting the features that are appropriate. Generally, the uniqueness of features enables an object to be recognised. Object representation and feature selection are closely related. For example, histogram-based representations use the feature of colour, while edges of objects are used as features in contour-based representation. A combination of these features is used in many tracking algorithms. Some of the visual features details that are common are as follows:
Colour. The visible object colour is affected mainly by two aspects, firstly, distribution of the spectral power of the light source and secondly, the reflective property of the surface of the object.
Edges. Changes in the intensities of the image are sturdily generated by the object boundaries. These changes are identified to detect edges. Compared to colour features, edges are less sensitive to changes in brightness. Canny Edge detector [1] is a popular edge detection method because of its simplicity and accuracy.
Optical flow. It is a concentrated field of displacement vectors. Brightness constraint is used in computing the optical flow and is assumed to be constant in the corresponding pixels of successive frames [2].
Texture. It is the measurement of the variation of intensity of a surface which quantifies properties such as smoothness and consistency.
Depending on the application area, features are manually selected. But, the process of selecting features automatically has gained popularity in the group of pattern recognition. Selection of features automatically is divided into two methods: filter and wrapper [3].
Object Tracking: The objective of an object tracker is to calculate the path of an object in a given amount of time by finding its location in each frame of the video. The task of object detection and finding the relation between the object occurrences can be done either separately or jointly. In the first type, using the algorithm of object detection the possible object regions are obtained in each frame. Then the tracker is used to link the objects across frames. While in the second type, the object region and association is estimated together by repetitively updating the location of the object and region information which is obtained from previous frames. In either of the approach to tracking, shape and/or appearance models are used to represent the objects.
Object tracking is split into three main groups. They are: Point Tracking, Kernel Tracking and Silhouette Tracking
Point Tracking. Objects that are detected in frames which are successive are represented by points. Object position and motion determine the previous object state which in turn gives the basis for the association of the points.
Kernel Tracking. The term kernel refers to the shape and appearance of the object. For example, the kernel can have a template that is rectangular or elliptical in shape with a related histogram. Tracking of objects is done by computing the kernel motion in frames that are consecutive.
Silhouette Tracking. Here the tracking is done by calculating the object region in each frame. Information that is encoded in the object region is used in this method of tracking. This information can be in the shape of apparent density and shape models. Which in turn is in the form of edge maps. Silhouettes are tracked either by matching the shape or contour development.
Kernel Tracking: This type of tracking is done by computing the object motion between frames. The motion of the object is represented by a primitive object region. The motion of the object is normally in the form of motion with parameters like translation, conformal, affine, etc or in the dense flow field which is computed in the following frames. The difference in the algorithms used is due to the appearance that is used to represent, the number of objects that are tracked, and the method that is used to evaluate the object motion. Tracking methods are divided into two subcategories. This division is based on appearance representation used, which are, templates and density-based appearance models, and multiview appearance models.
Tracking Using Template and Density-Based Appearance Models: The relative simplicity and low computational cost are the main reasons for the wide use of templates and density-based appearance models. Trackers in this category are subdivided into categories based on whether the objects are tracked individually or together.
Tracking single objects. Template matching is the most common approach in this category. It is a brute-force method for searching an image. In this method, the image that is being searched, Iw, is searched in a similar region of the object, Ot which is defined in the preceding frame. A limitation of this method is its computation cost which is high due to the searching that is done through brute force. By limiting the search to the near surrounding area of its previous position, the cost of computation can be reduced. Instead of a template, colour histograms or combination models can be computed by using the appearance of pixels. These pixels can be in the inside the rectangular or ellipsoidal regions.
Tracking Multiple Objects. Objects that are modelled individually do not consider the interaction between several objects and between objects and background during the process of tracking. An example of interaction between objects could be when one object is either partly or completely blocking the other object.
Tracking Using Multiview Appearance Models. In the preceding tracking methods, the appearance models are generally created online. Hence, the current observations that are obtained about the object are used by these models. As the objects appear different in different views, any dramatic change in the object views during tracking would render the appearance model invalid and the track of the object might be lost. Overcoming of this problem can be done by learning offline about the different object views and then can be used for tracking.
Silhouette Tracking: Because of their complex shapes of objects, like hands, head and shoulders, they are difficult to be described well. In this case, Silhouette based methods are able to provide an accurate description for these complex objects. Finding the object region in each frame is the goal of a silhouette-based tracker. This is carried out by an object model which is produced using the previous frames. This object model can be in any of the given forms: a colour histogram, object edges or the object shape. The tracker can be divided into two categories: Shape matching, and Contour tracking.
In the shape matching approach, a search for the silhouette of the object is done in the current frame. This search is done by calculating the similarities between the object and the hypothesised object.
On the other hand, in the contour tracking approach, an initial contour is evolved to its new location in the present frame. This is done by either using the models of state space or by minimising directly of some energy functional.
Shape Matching. This is done in a similar way to that of template matching, in which a search for a silhouette of an object and the model associated with it is done in the current frame. This search is done by calculating the similarities between the object and the generated model. In this type, the translation of the silhouette is done from the present frame to the next one. Hence, the object motion that is nonrigid is not clearly taken care.
Another approach in shape matching is finding matching silhouettes in two successive frames. Silhouette detection is mostly done by subtraction of background.
Contour Tracking. Unlike shape matching method, contour tracking methods repetitively draw a preliminary contour in the preceding frame to its new location in the present frame. This requires the overlapping of some part of the object in the present frame with that of the object region in the preceding frame. Two methods are used to carry out tracking by contour evolution. The first process uses state space models and the second one directly evolves the contour. This is done by decreasing the energy of the contour using the technique of minimisation (like gradient descent).
To track the complete region of an object, silhouette tracking is used. The ability to deal with a large variety of object shapes is an important benefit of silhouette tracking. Silhouette-based object trackers choose representations that are in the form of motion models, appearance model or shape models or they could be a combination of these different models. Object appearance is modelled by density functions that are either parametric or nonparametric. This could be a mixture of histograms and Gaussians. Modelling an object shape can be in the form of contour subspace. The subspace is produced from object contours which are obtained from various object poses [4].
Another important feature of silhouette tracking method is occlusion handling. The other important issue related to silhouette trackers is their ability to work with splitting of an object and merge
Point Tracking: Corresponding detected objects which are denoted by points across frames can be used for tracking. Occlusions, missed detections, entries and object exits make the correspondence of points a difficult problem. Point correspondence is divided into two categories, which are: deterministic and statistical methods. Qualitative motion heuristics [5] are used by the deterministic method to limit the correspondence problem. Whereas, probabilistic methods take both measurements of the objects and also take the uncertainties into consideration so as to establish correspondence.
Deterministic Methods for Correspondence: This method defines a cost of connecting each object in the previous frame (t-1) to a single object in the current frame(t). This is done by using a set of motion constraints. Correspondence cost is minimised by expressing it as a combinatorial optimisation problem. The following constraints are used to define the correspondence cost: proximity assumes that there would be no notable change in the location of the object between frames; maximum velocity defines an upper limit on the velocity of the object and restricts the possible correspondences to the region surrounding the object; small velocity change assumes that there is no drastic change in both the direction and speed of the object; common motion constraints the object velocity to be same in a small locality; rigidity assumes that in the 3D world, objects are rigid which makes the distance between two points unchanged; proximal uniformity is a grouping of the proximity and change in velocity controls.
Statistical Methods for Correspondence: Noise is present in measurements that are obtained from video sensors. In addition, the motion of an object could experience random perturbations like that in turning vehicles. Statistical correspondence methods are able to solve problems with tracking. This is done by taking into account the measurement and the uncertainties in the model during the estimation of the object state. This method uses the approach of state space to model the properties of an object such as velocity, position and acceleration.
Evaluation of point tracking methods can be done on the basis of whether the trajectories of the points are generated correctly. The performance can be calculated by finding the precision and recall measures. Also, object trackers can be compared qualitatively by basing their ability to deal with object entries that are new and exits of objects that are present, handle the observations that are missing, and provide a solution that is optimal to the problem of the cost function, which is used for establishing correspondence.
Point trackers are good for tracking objects that are small and can be represented by a single point. To track large objects, multiple points are required. One of the main problems with multiple point tracking is clustering of points belonging to the same object automatically. This problem is due to needing to distinguish between several objects and, between the objects and their backgrounds.
Feature Point Detectors
Feature detection refers to a method that is aimed at finding abstractions of the information of the image and make decisions locally at every point of the image. This is done irrespective of whether the feature of the image of a given type is available or not. The features that result are subsets of the image domain. This is usually in the form of points that are isolated, curves that are continuous or regions that are connected.
An interesting part of an image is defined as its feature. Repeatability is considered to be an important property of a feature detector. Feature detection is an operation of image-processing that is considered low-level. The different types of image features are edges, corners/interest points, blobs/regions of interest points and ridges.
Feature detectors are organised broadly into three groups: single-scale detectors, multi-scale detectors and affine invariant detectors [6]. In single scale detectors, features are represented only in a single way or the contours of the object by using the internal parameters of the detector. In the single-scale detectors, there is no change to transformations of the image like rotation, translation, illuminations changes and noise addition. But because of their incapability in dealing with the problem of scaling, it is essential to create multi-scale detectors.
Some of the single-scale detectors are Moravec’s detector, Harris detector, SUSAN detector, FAST detector, and Hessian detector. Multi-scale detectors are Laplacian of Gaussian (LoG), Difference of Gaussian (DoG), Harris-Laplace, Hessian-Laplace, and Gabor-Wavelet detector.
Moravec’s Detector [7]: This detector is particularly used in locating regions that are distinct in the image which is used to record image frames that are successive. This algorithm is used to detect the corner which is a point with few similarities with itself. Each pixel is tested by the detector in a given image to check the presence of the corner. A local image patch that is centred on t pixel is tested by the detector. Then the likeness between the patch and the patches overlapping nearby is determined. The sum of square differences (SSD) between the patch that is centred and the other patches of the image is used to measure the similarity.
Harris Detector: This detector was developed by Harris and Stephens [8]. In order to address the problems of Moravec’s detector, this detector was developed as a combination of corner and edge detector. Variation of intensity is obtained over various separate orientations, and this makes it a better detector in terms of detection and rate of repeatability.
SUSAN detector: Smith and Brady [9] introduced a technique that used a generic low-level of processing an image. This was in contrast to the computation of corners using derivatives of an image. This technique was called SUSAN (Smallest Univalue Segment Assimilating Nucleus). Adding to its capability to detect corners, SUSAN is also used to detect edges and reduce image noise. Detection of a corner is done by positioning a mask that is circular in shape which is of a fixed radius. This is performed for every pixel in the image. The pixel in the centre is known as the nucleus. Here the pixels that are in the area found under the mask are matched with a nucleus to find the similarity or dissimilarity in the values of intensity. Grouping of pixels that have similar brightness to that of the nucleus is done and the area that results from this is known as USAN (Univalue Segment Assimilating Nucleus).
The advantages of this detector are: (i) derivatives are not used, therefore there is no reduction in the noise or any computation that is expensive is required; (ii) Repeatability is high for feature detection; and (iii) no changes to translation and changes in rotation.
FAST (Features from Accelerated Segment Test): FAST is an efficient feature detector. It was developed by Rosten and Drummond [10]. It uses pixel comparison on feature point with a centred ring. It is a very fast detector of corners which uses segment-tests. It uses a 16-pixel circle to categorise whether a point p is a corner. p is classified as a corner if there are n adjacent pixels in the circle as shown in Figure 1 (courtesy of [10]). These pixels are either brighter or darker than the strength of the candidate pixel plus a threshold t.
Figure 1 – Feature Detection using FAST
In spite of performing better, FAST has some shortcomings which are improved with the approach of machine learning. Due to its high-speed performance, this detector is most suitable for applications using video processing in real-time. But, this detector is variant to changes in scale and vulnerable to noise.
Hessian Detector: The Hessian blob detector [11] is grounded on a matrix (2 x 2) of derivatives of the intensity of the image K(p, q) which is called the Hessian matrix. Analysing a local image structure can be done using the matrix. It is stated in the form
HDp, q, a= Kpp(p, q, a)Kpq(p, q, a)Kpq(p, q, a)Kqq(p, q, a)
where Kpp, Kpq and Kqq are image derivatives of the second-order which is calculated by using the Gaussian function of standard deviation a. A subset of points is searched for responses of derivatives that are high in two right-angled directions. Hence, points are searched by the detector where the Hessian matrix determinant has local maxima
detHD= KppKqq-Kpq2
Laplacian of Gaussian (LoG): It is a blob detector which is obtained by combining linearly the second derivatives. For an image J(r, s) that is input, the image that is represented in the scale space is found. The scale space of the image is given by LG(r, s, b) which is obtained by combining the image with a variable scale Gaussian kernel GK(r, s, b). Here
LGr, s, b=GKr, s, b*J(r, s)
and
GKr, s, b= 12πb2e-(r2+s2)2b2
The Laplacian operator is computed using the formula:
∇2LGr, s, b=LGrrr, s, b+LGss(r, s, b)
Difference of Gaussian (DoG): To overcome the problem of time consumption while computing LoG operator, Lowe [12] proposed an algorithm that was efficient. It was based on local 3D extrema in the pyramid of scale-space that is made with Difference of Gaussian (DoG) filters. SIFT algorithm uses this approach. In this situation, a close approximation to LoG is given by DoG which is used to detect features efficiently that is stable. This detection is done from the scale-space extrema. The difference of Gaussian function is calculated as:
DGc,d,δ=GKc,d,kδ-GKc,d,δ*Jr,s
=LGc,d,kδ-LGc,d,δ
Harris-Laplace: Mikolajczyk and Schmid [13] proposed Harris-Laplace which is a corner detector with scale invariance. It depends on a grouping of Harris corner detector and a Gaussian representation of scale space. Even though there is invariance to the rotation in the Harris corner points, they are not scale invariant. Hence, the detector which utilises the second-moment matrix is altered so as to make it independent of the resolution of the image. This is represented as
SMm, n, ρI, ρD=ρ2Dh(ρ1)I2m(m, n, ρD)ImIn(m, n,ρD) ImIn(m, n,ρD)I2n(m, n, ρD)
where Im and In are the derivatives of the image which are calculated using Gaussian kernel of scale
ρ
D in their respective direction. The constraint
ρ
I resolve the current scale at which points of Harris corner are found in the Gaussian scale-space.
This method provides a set of points whose characters are found in the image and in the scale dimension. The count of redundant interest points is also noticeably reduced here when compared to Multi-scale Harris.
Hessian-Laplace: The idea that is similar to Harris-Laplace is applied to the detector that is Hessian-based which leads to a detector that is scale invariant and named as Hessian-Laplace. Here a representative image is built using Laplacian filters or DoG filters. Next, using the Hessian determinant, blob-like features are extracted. A large number of features are extracted with this detector which covers the entire image, but at a comparatively lower repeatability as compared to that of Harris-Laplace.
Gabor-Wavelet detector: A multi-scale points of interest detector was proposed by Yussof and Hitam [14] using the principle of Gabor wavelets. The Gabor wavelets are biologically driven density kernels which are in the form of plane waves and are controlled by a function of Gaussian envelope.
Affine Invariant Detectors: The above-mentioned feature detectors display invariance to translations, rotations and scaling that is uniform; with the assumption that localisation and scale are not influenced by an affine transformation of the structure of a local image. Thus, the problem of affine invariance is partially handled. This is done while remembering that in each direction, the scale can be different. Hence, coming up with a detector that is strong to perspective transformations makes it necessary to make it invariant to affine transformations. An affine invariant detector is considered to be a simplified version of scale invariant detector.
Feature Point Descriptors
A feature descriptor is defined as an algorithm which uses an image and gives the output as feature descriptor. Similar to a fingerprint which is unique, feature descriptors are able to differentiate one feature from another. This is possible because they encode information into a series of numbers which are unique for each image. Ideally, the information would not be changing even after undergoing a transformation. This enables us to find the feature repeatedly even if the image gets transformed in some way.
The descriptors can either be floating-point representations like SIFT and SURF, or they can be binary descriptors like BRIEF, BRISK, ORB, etc.
Floating point representations use the method of representing the area around the image pixel with the use of 128 floating point numbers, which is a vector. This vector is usually a histogram of measurements obtained from the image. Which is generally functions of the local intensity gradients.
Binary Descriptors came about in order to solve the problem of time taken for extraction and dimensionality in floating point representations. This dimensionality problem further affected their space for storage and time taken for matching. Binary Descriptors came up with low dimensional and representations that were efficient. Here the values of the descriptor are assigned that are done quickly through comparing pixel intensities.
The different descriptors that are used to find features in images are explained below. First, I will look at Floating point representations like SIFT and SURF. Then I shall look at Binary Descriptors like BRIEF, ORB and BRISK.
SIFT (Scale-Invariant Feature Transform): Scale Invariant Feature Transform (SIFT) is an algorithm used in computer vision used to both detect and describe features that are local in the images. The algorithm was published by Lowe in 2004 [12]. It is used to solve the rotation of the image, affine transformations, strength and viewpoint changes in the features that match.
Points of interest on the object are extracted from any object in an image. This is used to give a description of the feature of the object. The description that is extracted from the image that is used for training purpose, is used in identifying the object while trying to establish the object in the test image. In order to have a reliable identification, the extracted features from the training image are able to be detected even when there is a change in image factors like scale, noise and illumination.
SIFT is able to strongly identify the objects even among clutter and also under partial blockage. There are four basic steps that are used in the SIFT algorithm. Firstly, using the Difference of Gaussian (DoG) which is a built-in detector, scale-space extrema is estimated. Next, the key points are localised and enhanced by removing the points that are of low contrast. Thirdly, each key point is given an orientation. This is done to get invariance due to the rotation of the image. And finally, the key points between the two images are compared. The comparison is done by finding their neighbours that are closest.
One of the major disadvantages of using SIFT for real-time applications is that it requires a large computational complexity.
SURF (Speeded-Up Robust Features): The Speeded-Up Robust Features is a detector and descriptor combination which was developed by Bay et al [15]. Similar to SIFT, Speed up Robust Feature (SURF) is also a patented and is both a local feature detector as well as a descriptor.
The main objective of this descriptor is to give an image feature description that is exclusive and robust. This is done by depicting the intensity of pixel distribution in and around the point of interest. Computation is done locally for most of the descriptors. Which allows a description to be found for every point of identified interest.
SURF approximates the Difference of Gaussian (DoG) with box filters. Here squares are used instead of Gaussian averaging the image. This is done because the convolution using box filter can be calculated easily with the help of integral images. BLOB detector is used by SURF as it is based on the Hessian matrix which is used to find the points of interest.
For the purpose of orientation assignment, it uses responses from wavelets in both horizontal and vertical directions by applying adequate Gaussian weights. For feature description, also SURF uses the wavelet responses. A key point neighbourhood is selected. This is then divided into smaller regions or sub-regions. Now for each of the smaller region, the wavelet responses are selected and represented to get SURF feature descriptor.
ORB Descriptor: Oriented FAST and Rotated BRIEF (ORB) was proposed by Rublee et al. as another efficient alternative for SIFT and SURF [16]. ORB is a combination of FAST key point detector and BRIEF descriptor. ORB was an alternative to SIFT that was fast and efficient. They are combined to improve the performance which is obtained by several modifications. Key points are found by FAST and then Harris corner measure is applied. This is applied to find top N points among them.
In Figure 2 given below shows an example of ORB being used to match two images with a change in viewpoint. Here valid matches are shown as green lines and unmatched points are indicated by red circles.
Figure 2 – ORB descriptor – An example of key points matching using ORB
A binary descriptor is created out of three parts. First a sampling pattern: the location to sample points in the area around the descriptor. Next, Orientation compensation: some method to calculate the orientation of the key point. To compensate for the rotation changes, the rotation is done. And finally, sampling pairs: comparison points when building the final descriptor.
The ORB descriptor is a bit like BRIEF. It doesn’t have a complex sampling pattern as BRISK [17] or FREAK [18]. However, there are two key differences between ORB and BRIEF: First, an orientation compensation mechanism is used by ORB to make its rotation invariant and secondly, optimal sampling pairs are used by ORB.
Orientation Compensation: ORB uses a straightforward measure of corner orientation – the intensity centroid [19]. Firstly, the moments of a patch are defined as:
momrs=∑i, jirjsI(i, j)
With these moments, the centroid is found, the “centre of mass” of the patch as:
C=mom10mom00,mom01mom00
Next, a vector is formed from the corner’s centre O to the centroid – OC. The orientation of the patch is then calculated:
θ=atan2(mom01,mom10)
After calculation of the orientation of the patch, before computing the descriptor I rotate it to the canonical rotation. Now the descriptor is computed thereby obtaining some invariance in the rotation.
BRISK (Binary Robust Invariant Scalable Keypoints): BRISK descriptor is different from BRIEF and ORB descriptors. It uses a sampling pattern similar to the one in Figure 3, that is hand-crafted (courtesy of [17]). It is made of coordinated rings. While looking at each sample point, a small patch around the point is taken and Gaussian smoothing is applied.
Figure 3 – BRISK descriptor – BRISK sampling pattern
When looking at each pattern of sampling, I can find pairs that are short and long. Short pairs are sampling points pair whose distance is under a particular threshold (d_max). Similarly, long pairs are sampling point pair whose distance is over a particular threshold (d_min). Here d_min is greater than d_max. Which means that there would be no short pairs that would also be long pairs.
Orientation is determined in BRISK using long pairs and comparing of intensity is done by using short pairs as shown in Figure 4 (courtesy of [17]). Similar to BRIEF and ORB, orientation and intensity comparisons are used to build the descriptor.
Figure 4 – BRISK descriptor – BRISK short pairs
KAZE: KAZE was proposed by Alcantarilla et al. [20] and was named in tribute to Iijima [21] who is considered as the father of scale space analysis. Nonlinear diffusion filtering technique is used in building the scale space instead of Gaussian blurring is its main strong point. It is a multiscale 2D feature detection and description algorithm in nonlinear scale spaces. It works in a nonlinear scale space. SIFT and SURF find features in the Gaussian scale space. But, blurring due to Gaussian scale space causes noise when developing the original image.
Using nonlinear diffusion, detection and describing the features in a nonlinear scale spaces. This can be done by keeping the details of the image intact and reducing or removing noise by developing the image in the scale space. Variable conductance diffusion, which is one of the cleanest cases of nonlinear diffusion is used. Because of their stability for any step size and are parallelizable, Additive Operator Splitting (AOS) is used to build the nonlinear scale space.
AKAZE (Accelerated-KAZE): Because of its computational cost, accelerated KAZE was created. It uses a mathematical framework called Fast Explicit Diffusion (FED) which is rooted in a pyramidal framework. It is used to improve the efficiency in the speed of the computation of the nonlinear scale space. In addition, Modified-Local Difference Binary (M-LDB) descriptor is computed which uses the gradient information from the nonlinear scale space.
The AKAZE algorithm works similar to that of KAZE. But unlike KAZE, it uses an improved version of Local Difference Binary (LDB) descriptor.
BRIEF (Binary Robust Independent Elementary Features): It is a descriptor of low bitrate. It performs only comparison test of binary. It uses Hamming distance in place of Euclidean distance. SIFT uses 128-dim vector for descriptors. As it uses floating numbers, it basically takes 512 bytes. Similarly, SURF also uses 256 bytes. So, they take up a lot of memory when creating a vector for features which may be in thousands. As a result, the time is taken for matching also goes up.
In this context, BRIEF provides a way to find the binary strings without the need for finding descriptors. It takes an image patch that is smoothened and selects a set of location pairs. Then comparisons of pixel intensity are done on these location pairs. Next, a bit string is obtained.
Project proposal
Dataset used: The dataset of still images (photographs) that are used is a sequence of 397 grayscale images. It consists of images with varying transformations that are geometric and photometric (change in viewpoint, change in scale, blurring of image, and change in illumination) The images contain different objects of interest which are either illuminated or occluded. They contain a vehicle moving through both areas of light and shade which appears to be moving in a straight line. It also contains several pedestrians moving across the frame. Sample images from the dataset are shown in Figure 5.
Goal: The goal of this research is to identify the best feature matcher for two images.
Feature Matching: It is the process of finding matching features between two similar images from the dataset. One of the approaches to match images involves detection of points of interest that is related with the descriptors of the image from the image data. After the features and the descriptors are found from the two images, some initial feature matches are established between the images as illustrated in Figure 6.
Frame 392
Frame 394
Figure 6 – Matching image regions based on their local feature descriptors
The image matching problem, when put in terms of a formula, can be denoted as follows. If d is a detected point in an image, by a detector that is linked to a descriptor
γd=γiD i=1,2,…, I
}
where, for all I, the feature vector specified by the i-th descriptor is
γid=(e1di,e2di,…,enidi)
The objective is to find the best match m in the other image from the set of N points of interest
M={m1,m2,…, mN}
. This is done by comparing the feature vector
γid
with the points in the set M. Hence, the measure of distance between the two points of interest descriptors
γid
and
γim
can be described as
tid,m=|γii-γim|
Sorting of points M in ascending order is done based on the distance
ti
. This is done for each descriptor separately and create sets
ϑd,M=ϑid,M i=1,2,…,i}
Such that,
ϑid,M=ϑi1,ϑi2,…,ϑiN∈M tid,ϑij≤tid,ϑik, ∀j>k}
A match is acceptable between the pair of points of interest (d, m) only if: the point d is the best possible match for point m when compared to rest of the points in the first image and similarly, point m is the best possible match for d when compared to the rest of the points in the second image.
Generally, the interest points based methods of matching performance is dependent on both the properties of the points of interest and the image descriptors chosen [22].
Next, I looked at the working of two common feature matchers: Brute-Force matcher and FLANN matcher.
Brute-Force Matcher: In this matcher, one feature descriptor from the first set is matched with all the other features in the second set. This is done by calculating the distance. The one that is closest is recorded.
The first step is to read the image (object) in frame k that is to be searched for in another image (scene) in frame k+1. The scene image need not have the same lighting, angle, etc to be able to match with the object image. The object image that I am trying to match is given in Figure 7.
Figure 7 – Object image (image to be searched)
The scene image from which the object image (template) needs to be searched is shown in Figure 8
Figure 8 – Scene image (image to be searched in)
As visible, the template image (object image) is smaller in size as compared to the image that is being used to search (scene image). There is some noticeable change in the position of the vehicle in both the frames.
Using brute-force matching, features in both the images are found. These features are matched. All the matches can be drawn. But if the number of drawn matches are large, then there would be many false positives.
//– Step 1: Detect the keypoints using ORB Detector
Ptr<FeatureDetector> detector = ORB::create();
detector->detect(object, keypoints_object);
//– Step 2: Calculate descriptors (feature vectors)
Ptr<DescriptorExtractor> extractor = ORB::create();
Mat descriptors_object, descriptors_scene;
extractor->compute(object, keypoints_object, descriptors_object);
Here, the keypoints and their descriptors are found for the object using the ORB detector.
detector->detect(scene, keypoints_scene);
extractor->compute(scene, keypoints_scene, descriptors_scene);
Similarly, the keypoints and their descriptors are found for the scene.
//– Step 3: Matching descriptor vectors using Brute Force matcher
BFMatcher matcher;
std::vector< DMatch > matches;
matcher.match(descriptors_object, descriptors_scene, matches);
This is the brute-force matcher.
//– Quick calculation of max and min distances between keypoints
for (int i = 0; i < descriptors_object.rows; i++)
{
double dist = matches[i].distance;
if (dist < min_dist) min_dist = dist;
if (dist > max_dist) max_dist = dist;
}
//– Draw only “good” matches (i.e. whose distance is less than 3*min_dist )
std::vector< DMatch > good_matches;
for (int i = 0; i < descriptors_object.rows; i++)
{
if (matches[i].distance < 3 * min_dist)
{
good_matches.push_back(matches[i]);
}
}
drawMatches(object, keypoints_object, scene, keypoints_scene, good_matches, img_matches, Scalar::all(-1), Scalar::all(-1), std::vector<char>(), DrawMatchesFlags::NOT_DRAW_SINGLE_POINTS);
Here all the matches are drawn based on a condition as shown in Figure 9.
Figure 9 – Matches drawn after comparing the descriptors (Brute-Force Matcher)
FLANN based matcher: FLANN stands for Fast Library for Approximate Nearest Neighbours. It is a library for carrying out fast searches on the nearest neighbour in spaces of high dimensionality.
This match performs better than BFMatcher for datasets that are large.
Similar to the previous matcher, the two images are read. Then the keypoints and descriptors are found.
// Flann needs the descriptors to be of type CV_32F
descriptors_scene.convertTo(descriptors_scene, CV_32F);
descriptors_object.convertTo(descriptors_object, CV_32F);
FlannBasedMatcher matcher;
std::vector< DMatch > matches;
matcher.match(descriptors_object, descriptors_scene, matches);
This is the FLANN matcher. Before using the descriptor, it needs to be converted to type CV_32F. Here, CV_32F defines the depth of each element of the matrix.
The matches that are drawn after matching using FLANN matcher is shown in Figure 10.
Figure 10 – Matches drawn after comparing the descriptors (FLANN Matcher)
Figure 11 – Matches using Brute-Force matcher
Figure 12 – Matches using FLANN matcher
Looking at the two graphs generated by plotting the matches for each frame for both Brute-Force Matcher and FLANN matcher, there is very slight difference in the number of matches generated by each.
All possible matches are tried by BFMatcher to find the best matches. On the other hand, FLANN matcher which is faster will find neighbours that are nearest. Good matches are found, but may not be the best ones. The real benefit of FLANN is seen with large data sets.
An efficient data structure is built by FLANN which will be used to search for an approximate neighbour. Whereas BFMatcher is sure of finding the best neighbour by doing a search that is exhaustive.
References
[1] | J. Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intelligence, pp. 679–698, 1986. |
[2] | B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial intelligence, vol. 17, pp. 185–203, 1981. |
[3] | A. L. Blum and P. Langley, “Selection of relevant features and examples in machine learning,” Artificial intelligence, vol. 97, pp. 245–271, 1997. |
[4] | A. Blake and M. Isard, Active contours: the application of techniques from graphics, vision, control theory and statistics to visual tracking of shapes in motion, Springer Science & Business Media, 2012. |
[5] | C. J. Veenman, M. J. Reinders and E. Backer, “Resolving motion correspondence for densely moving points,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, pp. 54–72, 2001. |
[6] | M. Hassaballah, A. A. Abdelmgeid and H. A. Alshazly, Image Features Detection, Description and Matching, Springer, 2016. |
[7] | H. Moravec,, “Towards automatic visual obstacle avoidance,” in Proceedings of the 5th international joint conference on Artificial intelligence, Morgan Kaufmann Publishers Inc., 1977, p. 584–594. |
[8] | C. Harris and M. Stephens, “A combined corner and edge detector,” in Alvey vision conference, Manchester, UK, 1988, pp. 10–5244. |
[9] | S. M. Smith and J. M. Brady, “SUSAN—A new approach to low level image processing,” International journal of computer vision, vol. 23, pp. 45–78, 1997. |
[10] | E. Rosten and T. Drummond, “Fusing points and lines for high performance tracking,” in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, Beijing, China, 2005, pp. 1508–1515. |
[11] | P. R. Beaudet, “Rotationally invariant image operators,” in Proc. 4th Int. Joint Conf. Pattern Recog, Tokyo, Japan, 1978, Japan, 1978, pp. 579–583. |
[12] | D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, pp. 91–110, 2004. |
[13] | K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” International journal of computer vision, vol. 60, pp. 63–86, 2004. |
[14] | W. N. J. H. W. Yussof and M. S. Hitam, “Invariant Gabor-based interest points detector under geometric transformation,” Digital Signal Processing, vol. 25, pp. 190–197, 2014. |
[15] | H. Bay, A. Ess, T. Tuytelaars and L. Van Gool, “Speeded-up robust features (SURF),” Computer vision and image understanding, pp. 346–359, 2008. |
[16] | E. Rublee, V. Rabaud, K. Konolige and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 2564–2571. |
[17] | S. Leutenegger, M. Chli and R. Y. Siegwart, “BRISK: Binary robust invariant scalable keypoints,” in Computer Vision (ICCV), 2011 IEEE International Conference on, 2011, pp. 2548–2555. |
[18] | A. Alahi, R. Ortiz and P. Vandergheynst, “Freak: Fast retina keypoint,” in Computer vision and pattern recognition (CVPR), 2012 IEEE conference on, 2012, pp. 510–517. |
[19] | P. L. Rosin, “Measuring corner properties,” Computer Vision and Image Understanding, pp. 291–307, 1999. |
[20] | P. Alcantarilla, A. Bartoli and A. Davison, “KAZE features,” Computer Vision–ECCV 2012, pp. 214–227, 2012. |
[21] | J. Weickert, S. Ishikawa and A. Imiya, “Linear scale-space has first been proposed in Japan,” Journal of Mathematical Imaging and Vision, vol. 10, pp. 237–252, 1999. |
[22] | T. Lindeberg, “Image matching using generalized scale-space interest points,” in Journal of Mathematical Imaging and Vision, Springer Berlin/Heidelberg, 2015, pp. 3–36. |
Figure 1 – Feature Detection using FAST
Figure 2 – ORB descriptor – An example of key points matching using ORB
Figure 3 – BRISK descriptor – BRISK sampling pattern
Figure 4 – BRISK descriptor – BRISK short pairs
Figure 6 – Matching image regions based on their local feature descriptors
Figure 7 – Object image (image to be searched)
Figure 8 – Scene image (image to be searched in)
Figure 9 – Matches drawn after comparing the descriptors (Brute-Force Matcher)
Figure 10 – Matches drawn after comparing the descriptors (FLANN Matcher)