Best Open-Source Tools for AI Video Enhancement

20 Oct 2023 Roman Babchenko

In the ever-evolving landscape of digital content, achieving impeccable video quality has become more than just a desire; it’s a necessity. As you delve into the world of AI video enhancement, you might find yourself amidst a sea of options. Navigating this space, especially if you’re embarking on app development or seeking to elevate your existing content, can be daunting. Fear not. This blog aims to be your compass, guiding you through the intricacies of open-source video enhancement tools. Together, we’ll explore the innovations steering this revolution, shedding light on tools that promise not just enhancement, but transformation.

Overview

In this blog post, we explore techniques and approaches to enhance video image quality using AI. We’ve conducted a high-level examination of the leading open-source solutions renowned for their effectiveness and popularity, avoiding diving too deep into the technical and mathematical intricacies of each. We also present an overview of their performance on various video files, underscoring the strengths and limitations of each method.
Our aim is to delve into the methods and techniques that can heighten the clarity and visual appeal of video content. Furthermore, we seek a solution that could lay the groundwork for future projects in this domain.
In conclusion, we provide guidance on effectively using and configuring these tools, ensuring you achieve the best outcomes in video image enhancement.

Task

We are working with two low-resolution video files.

Blurred (undetailed) 1000 x 700 resolution at 30 fps

Sharp (detailed) 640 x 360 low resolution at 30 fps:

The first file, despite its initially high resolution, suffers from both resolution issues and a lack of detailing and sharpness. Our aim is to enhance this video’s quality and resolution using state-of-the-art, open-source algorithms.
1. Blurred (undetailed) 1000 x 700 resolution at 30 fps:Blurred Video Sample Before EnhancementThe visual details of this file don’t match its high resolution. A significant number of objects within the video appear blurred and aren’t visualized properly. Our immediate task is to enhance the clarity of this image.

2. Sharp (detailed) 640 x 360 low resolution at 30 fps:Original sample of sharp video before resolution upscalingAlthough this sample has a lower resolution, the objects within it are clearly visualized. The objective here is to upscale the video resolution while maintaining the existing detail level.

Available Solutions Overview

The market offers a plethora of algorithms and models aimed at improving image quality. Out of the myriad options, we have chosen to spotlight those that have demonstrated significant efficacy on specific video samples. Our focus remains on open-source solutions, excluding commercial and proprietary products.
It’s pertinent to mention that certain proprietary solutions, given their specialized AI training, are deemed industry front-runners, making them tough competitors.

From the vast pool of available techniques and models, we’ve zeroed in on the following:

  1. Stable Diffusion Art upscaler
  2. ESRGAN (Enhanced Super-Resolution Generative Adversarial Network)
  3. DAIN (Depth-Aware Video Frame Interpolation)

Stable Diffusion Upscaler

The Stable Diffusion Upscaler is a machine learning model designed for image super-resolution, a task that involves increasing the resolution and quality of low-resolution images. This model represents an advanced and state-of-the-art approach, leveraging the concept of stable diffusion, a probabilistic model, to achieve high-quality upscaling.
The core of the Stable Diffusion Upscaler lies in the diffusion model. Diffusion models, a class of generative models, estimate the likelihood of an image being generated by iteratively adding noise to it. These models are grounded on denoising principles and have applications in various image generation and enhancement tasks.
Moreover, the Stable Diffusion Upscaler integrates the diffusion model with GANs (Generative Adversarial Networks).

Pros:
– High efficiency with blurred and low-detail images.
– Capability to use a text prompt for additional detailing.
– Provides consistent image processing.

Cons:
– Unstable result generation, especially at higher noise levels.
– Demands high hardware requirements.
– Manages low input resolution but consumes high resources.
– Lacks tile support.

ESRGAN Model Overview

ESRGAN, standing for Enhanced Super-Resolution Generative Adversarial Network, is an advanced deep learning model crafted for image super-resolution. This model builds upon the original SRGAN (Super-Resolution Generative Adversarial Network) and has garnered significant attention in the computer vision and image processing arenas due to its prowess in generating high-quality, high-resolution images from lower-resolution inputs.

For our purposes, we’ll employ a modified and enhanced version of ESRGAN named Real-ESRGAN, which, as per its creators, is touted to deliver superior results.

Pros:
– Delivers stable and coherent results during video frame processing.
– Highly efficient with clear, detailed objects.
– Offers speed in processing.
– Apt for handling large images and supports image tiling.

Cons:
– Shows limited efficiency in restoring heavily blurred frame details.
– Uneven image processing observed with high input resolutions.
– Potential mismatch in predicted texture scales.

DAIN (Depth-Aware Video Frame Interpolation)

Unlike models focused on enhancing video resolution and quality, DAIN centers on frame rate upscaling through interpolation. This is pivotal for ensuring smooth motion and refining transitions between frames.

Example of frame interpolation for smooth motion and refined transitions between frames

Video frame interpolation is the art of synthesizing non-existent frames between the original ones. Despite the advancements credited to deep convolutional neural networks, interpolation quality often diminishes due to factors like substantial object motion or occlusion. To address this, the proposal here is to detect occlusion explicitly by harnessing the depth cue in frame interpolation.

Solution Approach

The techniques and models discussed for enhancing image quality each come with their own set of strengths and weaknesses. Currently, there isn’t a one-size-fits-all solution suitable for every video or image. As such, our approach involves testing all the methods described on the video clips with the most challenging initial conditions. To accomplish this, we’ve set up a processing pipeline where the raw video is the input, and the output is an enhanced video using AI.

Processing Pipeline

The methodology for the video image processing pipeline remains consistent across all discussed techniques. The only variations arise based on the specific models chosen, and the associated parameters and pre-processing steps essential for their optimal functioning.
Overview of video processing pipeline from file decomposition to frame interpolation

1. Video File Decomposition
The input video file is segmented into individual frames for more detailed processing. This segmentation is executed using the OpenCV library. For more details, please refer to the OpenCV VideoCapture Class Reference and the associated documentation.

2. Pre-processing
Every AI model, especially those in the super-resolution category, has unique characteristics and constraints. Hence, prior to initiating the inference process, it’s imperative to condition the images to align with the model’s requirements and ensure optimal outcomes. In this context, our strategy involves calibrating the frame’s downscaling and applying a smoothing algorithm.

3. Prediction (AI Processing)
This phase is dedicated to image generation, adhering to the specific parameters set for the chosen model.

4. Frame Interpolation
This phase is invoked when there’s a need to augment the frame rate or when selecting the best frames from a larger set, stitching them seamlessly to achieve the desired frame rate.
To gauge the utility of the models, we’ll test them on our video samples, assessing the efficacy of each in terms of their suitability for the task at hand.

Stable diffusion

We will employ the stabilityai/stable-diffusion-x4-upscaler model as an inference endpoint on AWS Sagemaker for the Prediction step in our pipeline.
Workflow of Stable Diffusion Upscaler for video resolution enhancement

The video file is decomposed into individual frames and resized to a resolution that the Stable Diffusion 4x upscale model can process. In this instance, the resolution is 350 x 180 pixels. This limitation is due to the hardware constraints of the specific Sagemaker Studio instance for this model (ml.g5.12xlarge).

Subsequently, these frames are transmitted to the inference endpoint where the Stable Diffusion model is deployed. The processing adheres to the parameters chosen for this task.

prompt = “Trees, grass and a dog. High resolution.”
noise_level = 0

Along with the textual input, the model receives a noise_level parameter, which can introduce noise to the low-resolution input.

It’s crucial to highlight that, by default, the Stable Diffusion upscaler introduces noise to the original image to amplify the details in the resultant image. This can result in significant random distortions, or even introduce content unrelated to the original. The following image distinctly illustrates the effects of excessive noise input, where objects can be so distorted they become unidentifiable.

Demonstration of excessive noise effect on image quality

Here’s another demonstration of the noise parameter’s influence on artifacts: each subsequent frame produces distinct image details after all denoising iterations:

Demonstration of noise parameter influence on image details

Consequently, we set the noise level at its lowest. This strategy enabled a minimal generation of random content when transitioning from one frame to another, striking a balance between image quality and artifact presence.
Labeling this model’s function as “upscaling” would be a misnomer since it cannot accommodate the original image size. It is more aptly described as a detail restoration process. The model introduces details where they are absent due to poor quality and also heightens sharpness.
After processing the downsized resolution, we retrieve the original resolution at the output, in this case, 1000 x 720, albeit with more distinct details. After all frame adjustments, we begin the interpolation process, utilizing DAIN in our scenario, and eventually assemble it into a video with an augmented FPS. The end product is a video with enhanced image sharpness and a raised frame rate.

Result
Watch the transformation: original vs. enhanced video side by side.


Video sample enhanced using Stable Diffusion UpscalerExamining closely:
Close-up of video sample enhanced using Stable Diffusion UpscalerBehold the enhanced video quality achieved with Stable Diffusion.
While the outcomes might not be labeled as striking, they are still visibly discernible. A prominent limitation of this model is the emergence of random artifacts in the produced video. This is attributed to the characteristics of utilizing stable diffusion, which inevitably introduces random elements in each frame. Given that these artifacts can’t be entirely eradicated using current post-processing techniques, their presence was curtailed during the image production phase.

ESRGAN

Let’s explore the capabilities of the ESRGAN model, specifically its enhanced version, real-ESRGAN. For this model, we employed the same pipeline as previously described, with the only difference being the model used during the prediction stage.ESRGAN model prediction stage in video processing pipelineUsing the default parameters, we calibrated the input resolution for the model. We observed that the original video resolution of 1000 x 720 caused the image to be fragmented, leading each fragment to be processed individually by the model. This resulted in an unsatisfactory output. Even though the image was segmented, the fragments were of poor quality, and the model couldn’t process them uniformly. This fragmentation led to the emergence of sporadic sharp areas, rendering the image incoherent. The best results were achieved when the resolution was reduced from 1000 to 256 pixels in width, maintaining the original aspect ratio and applying antialiasing.

basewidth = 256
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), Image.LANCZOS)

Result
Watch the transformation: original vs. enhanced video side by side.


Video sample enhanced using real-ESRGAN model

For a detailed comparison:
Detailed comparison of video sample before and after enhancement using real-ESRGAN

Enhanced video sample showing improved sharpness of elements

The elements such as grass, tree leaves, and background objects appear sharper. This result is considerably better than the previous attempts.
Comparison of video enhancement results using real-ESRGAN

Behold the enhanced video quality achieved with real-ESRGAN.

Given that this video sample already exhibits ample detail at a resolution of 640×360, we refrained from reducing it. Instead, we aimed to upscale it using real-ESRGAN.


Video sample upscaled using real-ESRGAN modelComparison of video samples before and after enhancement and upscaling using real-ESRGAN

As evident, the real-ESRGAN model adeptly processed the low-resolution video, enhancing the details. It not only heightened the detail level but also upscaled the resolution from 640×360 to a full HD 1920×1080. Both outcomes using real-ESRGAN are superior to the results obtained from the Stable Diffusion on the same video samples. The frames are consistent, devoid of artifacts, and the video sequence remains stable.

Next Steps

We hope you could observe tangible improvements in our results to improve these sample videos. To continue a research and make things even better, we’ve identified at least two key approaches.

Model fine-tuning

In the realm of deep learning, fine-tuning is a strategy of transfer learning where the weights of a pre-trained model are re-trained on new data. Specifically for the most effective model, real-ESRGAN, it’s essential to undergo a training procedure with a custom dataset. This method is thoroughly detailed in this article. As an example, using high-resolution (HR) images tailored for the location of the video shoot can be beneficial. The broader and more specific the dataset, the more refined the results.

Postprocessing

Post-AI image generation, various enhancers can be applied for photo correction, color optimization, and sharpening. Subsequently, a second iteration can be initiated with milder parameters, perhaps upscaling by 2x, contrasting with the initial 4x.

Conclusion

Throughout our exploration of enhancing specific video samples, we rigorously evaluated some of the most renowned and efficient AI super-resolution solutions currently accessible. From our findings, real-ESRGAN emerged as the most fitting solution for video sequence improvement. With judicious input data selection, this network can yield commendable outputs.

However, while Stable Diffusion demonstrated satisfactory results for isolated frames, the comprehensive video sequence manifested noise and model-specific artifacts. It’s pivotal to highlight that the quality of the outcomes, especially for specific samples, aligns with some of the top-tier solutions in the industry. Such a methodology could potentially underpin an automated UI solution tailored for end-users. Moreover, with model fine-tuning, on-demand video restoration for specific samples becomes a plausible endeavor.

Contact Us

Contemplating a project?  Get in touch with us to turn your ideas into digital reality.

Ready to start the conversation?