As you know, recently, machine learning has started to gain much attention. The main reasons for this are the development of machine learning algorithms, development of the new hardwares with high processing power, and the increasing in the amount of data available.
Facial recognition systems are one of the important sub-topic of machine learning. While the main areas of use are security, public surveillance systems, mobile applications and social networks such as facebook, snapchat etc., we see these systems will occupy our lives much more.
In this article, we will try to establish a face recognition system in a streaming video. As known, pre-processing, feature extraction and recognition process require a lot of processing power even for one video frame. We can understand how difficult it is to process, if we think there are dozens of frames in a video. So, besides the requirement of having high accuracy, we also have to be able to process them very fast for a stream.
In order to perform face recognition through a video, it is first necessary to be able to determine the faces in video frames. At this point, the viola jones algorithm plays a role. Although, the best way to recognize the faces is defining the specific points of them which are called landmarks and give information about the face shape(the sizes of the mouth and nose, eyes distance etc.), we will try to do that task with extracting the HOG features of faces. The next step is training process with these feaetures and then classify the faces with the obtained classifier.
The HOG (Histogram of oriented gradients) features are still the most commonly using method in object/face recognition systems due to it is very effective, simple and fast. Firstly, i crop the faces from the entire frames. Then, apply them the sobel filter to see only the edges of the faces and after that, the HOG features of the ROI(Region of Interest – in this case, it is face) have to be extracted and the classification should be done with the SVM(Support Vector machine) or any other machine learning technics. The basic idea is like this. I prototype it with matlab for now, but i have a plan to do it with python again, in the near future. Whole process :
- Face detection from the frames with the Viola jones Algorithm
- Create your face database for different classes with the detected regions(Faces)
- Apply them the sobel filter to see the only edges of the faces
- Extract the HOG features for all the processed sample in the database
- Create a classifier with these features and the SVM
- Again, face detection with Viola Jones Algorithm for every video frame
- Apply the sobel filter to faces
- Extract the HOG features of the processed faces
- And classify with these features and weight matrices that we composed with SVM in the training part. As a result, it will most probably give it, whichever class the detected face belongs.
Viola Jones Algorithm
The Viola jones algorithm is very fast to detect the faces.(You will see in the next sections why.) It consists of four steps: Haar feature extraction, integral image, adaboost and cascading:
At the figure below, we can see the horizontal and vertical lines in the picture at the left side. If we want to get just horizontal lines of it, we can see, horizontal components are left as a result when we apply the following filter to the whole picture by shifting one by one pixel.
The haar features are also have the same logic. The Viola jones algorithm works by dividing the pictures into windows consist of 24×24 pixels. We genarate all kind of features from the windows of the pictures with similar kernels below. With all these kernel types are aimed to bring out all the features of the windows. A feature vector of a picture is created by applying different scale and location combinations of all these filter structures to the windows, as shown in the example below and combining these features of the windows together after extraction. The following 2 sample features below, provide us information so that, a face can be defined as a face. As can be seen at the right side of the figure, the difference between the area where the eyes are located and the upper cheeks, can be understood by calculating. Pixels in the area where the eyes are located are multiplied by the black pixels(multiply with +1) and those on the lower side are multiplied by the white pixels(multiply with -1) after, with summing these values the difference can be seen by the computer. This difference is easily understood by the computer because the area where the eyes are located is darker than the area where the upper cheeks are located. Likewise, the type-3 kernel allow the nose belt to be determined. If it continues in this way, all the features of the windows can be extracted. With this way, even for a 24×24 pixels window, the feature vector will have 160000+ length. But, we have lots of window in a frame and lots of frame in a video. It is computationally very expensive and nearly impossible for classification(face detection) section to do for a video stream. So, some accelerating methods have been developed in this algorithm. (integral image, adaboost and cascading)
The purpose of the integral image is providing an easy and fast method for calculating the summing of the pixels under the filter area, when we applying the filters to the image window each time. With this method, instead of summing all the pixels under the filter, we can extract the value from the 4 vertices of that area in the integral image, that we have obtained. We can calculate the integral image of a window as shown below. The pixel, which we want to calculate integral of it, we sum it with all the pixels at the left and the top area. The algorithm calculate all the pixel values of the integral image with this way. Once, we have calculated the integral image of a window, then we can extract any region value of it easily using the vertices. You can see the calculating of the D area below.
With the Adaboost step, we can reduce the haar feature vector length from 160000+ to around 2500 for a window, with elimanating the irrelevant features. It is significant when we apply the tpye-3 haar filter to the nose area, but for the upper lips area, this operation will be meaningless for obtaining a face classifier. That will be irrelevant feature and not neccessary. So, the adaboost eliminate the irrelevant features, and makes a strong classifier by combining the relevant features(weak classifier) with their weight numbers. We can think of the weight numbers as a percentage of how strong the features need to be on the face.
Although, we have reduced the length of the features to the extent that they are relevant for face recognition, classifying images with a 2500-length feature vector for each 24×24 window still requires high processing power. Therefore, at the last step, all these features are distributed to stages by cascading method and the sub windows are evaluated according to the features of each of them and are classified as a “face” or a “not face” as a result. Consider that, 10 features per stage are distributed. If the incoming sub window can not pass through the first 10 properties, the classification operation will result in “not face”, immediately. In the other case, it passes through to the other stages for comparison with the other features. This will obtain important performance increasing by eliminating non-face-related windows at the very beginning.
You can read this paper, if you want to get much more detailed information about the Viola Jones Algorithm.
If we visualize the whole algorithm process, it will look like this:
In matlab, you do not have to train a classifier with thousands of faces and non faces samples, because there are already trained model according to that Viola Jones algorithm. We are so lucky 🙂 This line is enough to call the detector :
faceDetector = vision.CascadeObjectDetector();
And use it with your webcam. It will detect your face and track it:
%With this line it can detect the faces up to 175x175 size. This is for %performance increasing. It ignores to look for faces smaller than this %size. faceDetector.MinSize = [175 ,175]; % Create the point tracker object. pointTracker = vision.PointTracker('MaxBidirectionalError', 2); % Create the webcam object. cam = webcam(); cam.Resolution = '640x360'; % Capture one frame to get its size. videoFrame = snapshot(cam); frameSize = size(videoFrame); % Create the video player object. videoPlayer = vision.VideoPlayer('Position', [100 100 [frameSize(2), frameSize(1)]+30]); runLoop = true; numPts = 0; while runLoop % Get the next frame. videoFrame = snapshot(cam); videoFrameGray = rgb2gray(videoFrame); if numPts < 50 % Detection mode. bbox = faceDetector.step(videoFrameGray); if ~isempty(bbox) % Find corner points inside the detected region. points = detectMinEigenFeatures(videoFrameGray, 'ROI', bbox(1, :)); % Re-initialize the point tracker. xyPoints = points.Location; numPts = size(xyPoints,1); release(pointTracker); initialize(pointTracker, xyPoints, videoFrameGray); % Save a copy of the points. oldPoints = xyPoints; % Convert the rectangle represented as [x, y, w, h] into an % M-by-2 matrix of [x,y] coordinates of the four corners. This % is needed to be able to transform the bounding box to display % the orientation of the face. bboxPoints = bbox2points(bbox(1, :)); % Convert the box corners into the [x1 y1 x2 y2 x3 y3 x4 y4] % format required by insertShape. bboxPolygon = reshape(bboxPoints', 1, ); % Display a bounding box around the detected face. videoFrame = insertShape(videoFrame, 'Polygon', bboxPolygon, 'LineWidth', 3); % Display detected corners. videoFrame = insertMarker(videoFrame, xyPoints, '+', 'Color', 'white'); end else % Tracking mode. [xyPoints, isFound] = step(pointTracker, videoFrameGray); visiblePoints = xyPoints(isFound, :); oldInliers = oldPoints(isFound, :); numPts = size(visiblePoints, 1); if numPts >= 50 % Estimate the geometric transformation between the old points % and the new points. [xform, oldInliers, visiblePoints] = estimateGeometricTransform(... oldInliers, visiblePoints, 'similarity', 'MaxDistance', 4); % Apply the transformation to the bounding box. bboxPoints = transformPointsForward(xform, bboxPoints); % Convert the box corners into the [x1 y1 x2 y2 x3 y3 x4 y4] % format required by insertShape. bboxPolygon = reshape(bboxPoints', 1, ); % Display a bounding box around the face being tracked. videoFrame = insertShape(videoFrame, 'Polygon', bboxPolygon, 'LineWidth', 3); % Display tracked points. videoFrame = insertMarker(videoFrame, visiblePoints, '+', 'Color', 'white'); % Reset the points. oldPoints = visiblePoints; setPoints(pointTracker, oldPoints); end end % Display the annotated video frame using the video player object. step(videoPlayer, videoFrame); % Check whether the video player window has been closed. runLoop = isOpen(videoPlayer); end % Clean up. clear cam; release(videoPlayer); release(pointTracker); release(faceDetector);
So yes! We have an awesome classifier like the hotdog and not hotdog classifier of Jian Yang with this code. Face and not face 🙂
Ok, we can detect the faces in a streaming video with this classifier, but i am aiming to recognize them, in addition. So, i need to create database with the faces of the people.(I have 4 people for now and thousand sample per each, one of them is me) With the help of this face detector, I collected the face samples by standing at different angles and making some mimics. This is important, because we need as many samples of face gestures with different luminosity as possible for a better training result. We can collect one thousand samples in about a minute for each person with this way. Now, we should apply the edge filter, extract the HOG features of them and train these features with the SVM classifier.
HOG(Histogram of Oriented Gradients) Features:
We can say the hog features are compressed and coded version of the pictures created via cells and blocks structures. They give us information about luminosity densities and orientations of them for each pixel. We can observe local shape changes with this way.
The hog feature vector is generated by combining the gradient calculation for each pixel, the creation of a histogram for each block wtih these gradients, calculating the normalization of the histograms, and collecting the normalization vectors together for each block.
For each pixel at each (x, y) coordinate, the magnitude \(m (x, y)\) and the direction \( θ (x, y)\) must be calculated. The gradient calculations on the x and y axes, by showing each pixel’s brightness value with \(f (x, y)\) :
so the magnitude will be => \(m(x,y)=√(f_x (x,y)^2+f_y (x,y)^2 )\)
and the angle => \(θ(x,y)=arctan (f_x (x,y))/(f_y (x,y) )\)
We can calculate the gradients like this, but these calculations have no meanings by their own. We have to see impacts of all the pixels magnitude and direction in a histogram. This is done by gradient vote step.
Gradient Vote Calculation:
Here, it is decided how much and which region of the created histogram will be affected by the gradients, which calculated for each pixel according to size and direction. In Matlab, 9 bins(region) are generated by default in a histogram for directions between 0-180 degrees (20 degrees in each region). The gradients should be distributed in such a way that, each pixel that is computed according to magnitude and direction should sit at the centers of these 9 bins. For example, ¼ magnitude of a pixel with an 45 degree angle, will be added on 30 degree bin, and ¾ of it will be added on 50 degree.
After a weight(α) calculation of a pixel, the voting of it will be like this:
\(α=(n+0.5)- (b ∗ θ(x, y))/π\) n is the bin number of pixel which it belongs.
\(m_n=(1 – α) ∗ m(x, y)\) b is the total number of bins (9)
In the last step, the normalization vector is obtained by using the histograms of each block. The length of this vector is 36 as default in matlab, because each block has 4 cells and the pixels have 9 bins to be represented:
\(V_i^N= V_i/√(||V||_2^2+ ε^2 )\) ε = is a small constant number to not divide with zero.
\(||V||_2^2=V_1^2+ V_2^2+…+ V_36^2\) v = is the non normalized vector containing all histograms in a block
With combining of these normalization vectors of each block, we obtain the hog feature vector as a result. The length of a hog feature vector of a 240×240 size picture will be 30276. You can calculate it for your face samples like below in the matlab:
BlocksPerImage = floor((size(I)./CellSize - BlockSize)./(BlockSize - BlockOverlap) + 1); BlockOverlap = ceil(BlockSize/2); N = prod([BlocksPerImage, BlockSize, NumberOfBins])
Visualization of a HOG feature vector of the processed face :
%Edge detection of my face from the training database [~, threshold] = edge(normalize8(rgb2gray(read(trainingImages(1),3))), 'sobel'); fudgeFactor = .7; BW1 = edge(normalize8(rgb2gray(read(trainingImages(1),3))),'sobel', threshold * fudgeFactor); %extract the hog features of the processed face [featureVector,hogVisualization] = extractHOGFeatures(BW1,'CellSize',[8 8]); %visualize them figure; subplot(1,3,1); imshow(rgb2gray(read(trainingImages(1),3))); subplot(1,3,2); imshow(BW1); subplot(1,3,3); imshow(BW1); hold on; plot (hogVisualization);
You can see the result is so nice 😀
You should apply the sobel edge filter to whole your face database and extract the hog features of them for training process.
Training with Support Vector Machine :
SVM classifiers are generally used for linearly separable classes. The working logic is very simple. It tries to determine the best hyperplane between the classes to seperate them. Such a hyperplane should be defined between the classes as you can see in the figure below, so that, you can fully separate the two classes.(y is the class/label vector, w is the weight vector, x is the hog feature matrix of the dataset and b as the bias vector) So, the class value will be 0 on the hyperplane(orange one), which will seperate the two classes from each other. If the value of the samples above this hyperplane is 1 and the values below are -1, then the distance between the two closest samples is calculated as follows:
our hyperplane equation
if \(y ≥ 1 => Ɐx ∈ class 1 \)
if \(y ≤ -1 => Ɐx ∈ class 2 \)
\(w^T/||w|| (x_1-x_2 )= 2/||w||\)
We can see from the figure and the equations that it is trying to obtain the greatest margin for the best hyperplane. In order to do this, it is necessary to reduce the length of our weight vector as much as possible. So, actually SVM try to minimize the weights of the features.
In this figure, by SVM, only two classes seem to be able to separated, but with a method called one-to-all, all the classes we have can be separated and we can easily do all of this things in MATLAB with a few lines of code.
It is time to magic :p