Crowds present a large problem in many areas — from retail to civil services to banks. Usually, detection of people in queues is done with cameras with built-in facial recognition.
In many cases, however, the location in question already has regular security cameras installed, and the installation of new cameras would cost a lot of money.
Recognition of people in queues might be one of the more trivial tasks in computer vision, but it is a nuanced and complex problem that often requires custom software development.
This article describes how a local city hall augmented its low-resolution (649 by 480 px) security cameras with computer vision to count how long the queues are and to detect the busiest office hours.
Machine learning models for silhouette recognition
When you need to recognize people in a crowded space with low-resolution cameras using a computer vision approach, the best course of action is to use a pre-trained model.
There are a few of them available capable of detecting silhouettes. Faster R-CNN and YOLO v2 and v3 are among the best architectures for this task.
YOLO v2 recognition results:
Faster R-CNN recognition results:
YOLO vs Faster R-CNN for silhouette recognition
The advantage of YOLO is that the model responds faster, and this is important in some tasks. However, in practice, it turned out that if it is not possible to use a pre-trained version of the model, and additional training on a specialized dataset is needed, it is better to use Faster R-CNN.
If the camera was installed far enough (the silhouette height is less than 100 pixels for a resolution of 1920 by 1080) or it was necessary to additionally recognize personal protective equipment on a person, like helmets, fasteners, protective clothing elements, etc., meaning additional training on a custom dataset had to be involved, YOLO v2 performed poorly.
An example of using the Faster R-CNN model, retrained on its own dataset:
Creating your own training dataset
Pre-existing datasets often lack the images needed to train a specialized model. For example, people dressed in specialized safety gear or heavy winter clothes, etc.
Here’s a workflow one may follow when trying to create a custom dataset:
- collect video clips with the required objects: customer videos, videos in the public domain (like public footage from surveillance cameras);
- cut and filter the video fragments so that the resulting dataset is balanced for various recognition objects
- markup the dataset using markup tools like this one
- selectively check the results of the markup work
- if necessary, perform dataset augmentation: add different head positions, reflections, change the footage sharpness
Alternative approach: head detection
Choosing a correct model, increasing or changing the training sample, and other purely ML-based methods are not the only way to achieve high-quality results. Sometimes a decisive improvement can be obtained only by changing the entire approach to solving the task at hand.
In real queues, people crowd and therefore overlap each other, so the quality of recognition is often too low to use only the silhouette detection method in real conditions.
Take the image below. There are 18 people in the frame, and the silhouette detection model identified 11 people:
To improve the results, there was a need to detect peoples’ heads instead of silhouettes. To do this, the Faster R-CNN model was trained on a premade dataset (the dataset includes pictures of large gatherings of people).
Moreover, the dataset was enriched by about a third with frames from the customer’s material (from video cameras) mainly due to the fact that there were few heads in winter hats in the original dataset.
The main problems that were encountered are the image quality and the scale of objects. The heads have different sizes (as can be seen from the image above), and the frames from the customer’s cameras had a resolution of 640x480, because of this, random objects are sometimes detected as heads (hoods, Christmas ornaments, chairs).
However, in general, this model copes quite well in cases where there is mass congestion of people. So, in the frame above, the model identified 15 people:
Thus, in this image, the model failed to detect only three heads, which were significantly blocked by other objects anyway.
To improve the quality of the model, you can replace the current cameras with high-resolution ones and additionally collect and mark up the training dataset.
Nevertheless, it should be borne in mind that with a small number of people, the method of detecting silhouettes is more suitable than detecting heads, since it is more difficult to completely block the silhouette or confuse it with other objects. However, if there is a crowd, there is no way out, so to count people in the queue, it was decided to use both models in parallel and combine their results.
Silhouettes and heads, an example of a recognition result:
In conclusion, it’s worth mentioning cases of very large crowds of people, for example, crowds at stadiums.
In these cases, we are talking about estimating the size of a crowd: if it is a crowd of 300 people, the answer 312 or 270 is considered acceptable.
In practice, there was no need to solve such problems with the help of video analytics. However, some models are capable of counting large numbers of people with high accuracy.
The following is the result of the aforementioned model:
This article is a translation. Read the original article here.