Computer vision in general and facial recognition in particular are rapidly growing fields of computer science that can be beneficial in or even revolutionize some areas, such as transportation or retail. In this article we’re going to look closely at how such system could be built.

The source code for this project is available on GitHub. You can run it locally or you can take a look at the demo app to get an idea on how it works. This service does only one thing: it processes a video from a webcam, detects faces and calculates if people it sees have been seen before. This is a generic computer vision task that many applications nowadays leverage upon, such as facial identification on your phone or security systems with access control.

First, let’s get the bigger picture on how this system operates.


The service consists of three distinct parts: a WebRTC server, an application server and a recognition server. Client applications both stream video data and get facial recognition results back over WebRTC. The process can be described with the following sequence diagram:

Sequence diagram

Now let’s take a closer look on each actor of the system.

Client application

The client in this example is a web application built with React, Redux and Next.js, but it is also possible to create a mobile or desktop application, since there are WebRTC implementations for most modern platforms. Even more, with just some small changes, you can process video from IP cameras over RTSP.

The client application also acts as a temporary store of face data received from the recognition server. In a real-world system, this task will be performed on a server-side and will involve a database, but for the sake of simplicity, this is out of the scope of this article.

WebRTC signaling and streaming is performed by the OpenVidu client over WebSockets, so the WebRTC implementation is trivial.

WebRTC server

The WebRTC server consists of several building blocks. At its core, it is based on Kurento (a media server) and OpenVidu (a signaling server), both are open-source solutions with a paid support for OpenVidu that allows to scale to multiple nodes. The third part of the solution is Coturn – a TURN server, which is used for NAT traversal. There are also Redis and NGINX involved internally to handle the state management and to secure communcation between the system components.

Overall, this solution works out of the box and have unopinionated client libraries available, so we’re not going to dive into the details here. It’s worth mentioning that OpenVidu has recently released a mediasoup support to replace Kurento, which can dramatically improve the overall performance and provide a low-level access for Node.js applications.

Application server

This server is responsible for the following tasks:

  • Provide an API to securely initiate WebRTC sessions. It is possible to initiate a session from the client directly, but that entails security risks and should not be used in production.
  • Connect to client video broadcasts and take regular snapshots. The server acts as a WebRTC client and saves video frames to canvas for further processing.
  • Provide snapshots to the recognition worker and relay the results back to clients. When the worker responses with face data, the server crops a previously stored snapshot to only include regions that contain faces and sends it to the client alongside with face encodings.
  • Host the client web application. This is not a must, but rather a neat feature that Next.js provides.

Recognition worker

Last but not least, the server that is responsible for implementing facial recognition tasks. It is a Python web server that utilizes the infamous face_recognition library, which is used to find and to encode faces on images. In this example it’s going to be as simple as that: a stateless server that uses a readily available library to do all the heavy lifting. We’re going to look at possible directions on how to improve upon this solution at the end of the article.


Ultimately, we want to be able to tell if two photos contain a face of the same person. But what we’re going to compare, exactly? In order to perform a comparison, we need to transform a photo, which is merely an array of pixels, to something more meaningful. This process is called face encoding. It can vary based on the approach and algorithm used, but overall you can think of it as a measurement of facial features, such as a distance between the eyes, eyebrows, nose or lips, their size and position. The composition of these features can be represented by an N-dimensional vector (in this case, a 128-dimensional one), sometimes called face vector. By calculating the Euclidean distance between two face vectors, we can measure how close those faces are.

  • typescript
  • json
const newFace = {
encodings: [/* face vector */],

const candidates = [];

// Compare the face with all existing faces
for (const existing of existingFaces) {
// Calculate the Euclidean distance, which is a measure
// of how two faces are similar to each other.
// The lower the distance, the more similarity they share.
const sum = existing.encodings.reduce((res, x1, i) => {
const x2 = newFace.encodings[i];

return res + ((x1 - x2) ** 2);
}, 0)

const distance = Math.sqrt(sum);

// Only consider faces that are similar enough (distance < 0.6)
if (distance >= 0 && distance < state.threshold) {
match: existing,

// Find a face with the shortest euclidean distance
const candidate = candidates
.sort((a, b) => a.distance - b.distance)[0];

// Calculate the similarity percentage
if (candidate) {
const linear = 1.0 - (candidate.distance / (state.threshold * 2.0));
const score = linear + ((1.0 - linear) * Math.pow((linear - 0.5) * 2, 0.2));
const similarity = Math.round(score * 100);

alert(`Found a face with the similarity of ${similarity}%`);
} else {

alert(`A new face detected`);

Going further

While this example is an MVP and doesn’t cut many corners, it still requires some attention before it can be used in production. There are a few things to consider if you’re going to build a real-world service on top of it.

Machine learning pipeline

If you ran this services yourself, you could have noticed that image processing takes a decent amount of time. One of the reasons behind that is face_recognition algorithm used for face detection. The process of face encoding is preceded by face detection. In order to encode facial features, you need to find them on the image first. Instead of relying on the library’s ability to both detect faces and to encode facial features, you could use a faster ML model to detect faces first, then find and encode facial features in a just small region of the image. This becomes critical when processing higher definition images and handling a lot of requests at the same time.

Moreover, your service could also use specialized ML models to classify faces by age or gender. That opens huge possibilities, but also requires a lot of fine-tuning and performance optimizations.

Hardware acceleration

A model inference can have a much higher performance by using a GPU, but since it’s highly platform-dependent (especially when you put Docker in the mix), it was intentionally disabled. Moreover, depending on your requirements, your application could also benefit from a hardware-accelerated canvas. And even without a discrete GPU, you could still optimize the performance by using the right CPU instructions or CPU architecture that your ML framework is optimized for.

Persistent storage

This demo application stores face data in memory on the client. In a production application, you would likely do that on the server-side and using a persistent storage, such as a database. The main obstacle in doing so is that there aren’t that many databases out there that can efficiently index and query over N-dimensional vectors. You could solve that by writing your own indexing algorithm or by using 3rd party extensions available for some popular RDBMS.