Learn More

Using Machine Learning to Improve Video Resolution a.k.a “Super Resolution”

Stefan Battiston
Dec 7

In this Well Red post, we explore the feasibility of using machine learning to upscale low resolution video to high resolution, using a technique called Real-time Super Resolution.

At REDspace, we’ve recently launched the Evolving Skills Program (ESP), a new program for investigating cutting-edge technologies with the goal of bringing brand new capabilities to our clients. Our pilot project under this program was an investigation of the “production-readiness” of Real-time Super Resolution. This was chosen for its potential benefit in delivering high quality video, something that many of our clients do.

What is Super Resolution?

Super Resolution (SR) is a term to describe a trained neural network that accepts low resolution images/video as input and outputs high resolution images/video, with the goal of achieving a greater visual improvement compared to traditional video upscaling algorithms, such as bilinear, or bicubic interpolation. SR can be implemented “offline” resulting in a new, higher resolution video file that can be served to clients. This results in a larger file size, which uses more bandwidth to deliver, but could be useful in a situation where the source video does not exist in a high resolution form, like for old or archival video. In this kind of application, SR beats standard upscaling algorithms in terms of video quality.

SR can also be done in “real-time” where the SR is applied on the client-side as a video is played, saving the difference in bandwidth (minus the much smaller size of the trained model).

Real-time Super Resolution vs. traditional video compression

Real-time SR is somewhat analogous to lossy video compression, where video is sent across the network at a smaller size, with the client understanding how to decompress the video to display it at full resolution.

An important distinction is that real-time SR doesn’t modify the video, meaning that it can be used together with traditional compression.

Implementation

Super Resolution is a broad goal, and there are many possible ways to implement a solution. This implementation is based off this paper: [1609.05158] Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

<p>Figure 1 from paper linked above, describing the layers of the CNN</p>

Figure 1 from paper linked above, describing the layers of the CNN

Offline, using ffmpeg, a single video is sliced into png format, downscaled bicubically (e.g. 1080p -> 360p), then upscaled bilinearly to the original resolution (360p -> 1080p). Using these 3 sets of images, a convolutional neural network (CNN) is trained to produce a difference map between the low res and high res, and outputs this map as a set of numeric weights for each CNN layer. Then, in the client’s browser, a WebGL texture is created from each set of model weights such that it applies that layers’ convolution, and together, the textures transform the video in real-time to improve the visual quality.

This “real-time” implementation of the linked paper was originally created by Nick Chadwick at Mux, and forked so we could experiment with it and try to improve performance.

The results

You can view a demo of the real-time SR on some videos on this demo site. The “Third Party Reference Demos” section are videos provided and trained by Nick Chadwick. In the “REDspace Demos” section you can view a 1080p video used as input, and the SR 360p demo. 

The following screenshots are not realistically representative, as the screenshotting and downscaling process alters the quality differently than when actually viewed.

The 12-second 1080p video is 35MB, and the downscaled 360p video is 4.3MB. With the individually-trained video weights costing 135KB, and the one-time cost of 34KB for the WebGL code to apply the model, the total bandwidth difference is just over 30MB, which equates to 12.7% of the original video size.

Although it is lower quality than the 1080p, the SR video itself offers impressive video quality improvements over the 360p, with more clearly delineated grass in the background, and reflective water caustics re-added.

Limitations

Real-time application of SR relies heavily on the GPU to apply the trained model as a WebGL shader. For example, on a MacBook Pro 2017, the model uses nearly 100% of the AMD Radeon Pro 555 GPU, when running in Chrome (slightly less in Firefox). There is a lot of room for improvement in this aspect, but SR will always be more computationally expensive than bicubic interpolation.

As well, compared to a traditional upscaling algorithm like bicubic, SR requires offline ML training for each video, which is also GPU/time intensive, and incurs cost.

Next steps

Client-side

Can we utilize tensorflow.js to handle the client/browser side? Right now the WebGL code used to apply the trained model/weights is written by hand, can only handle very specific situations and is not very optimized. Currently tensorflow.js has a WebGL backend, but it only uses it as a kind of “processor black-box,” where it passes input data for calculations and receives the output. For real-time SR, since we’re applying the model using WebGL, we would want to run tensorflow without needing to pass every frame into TF.js WebGL, then having it output the modified frame, then having to pass that back into WebGL to apply. If we can modify tensorflow.js to suit our purposes (as described in this Github issue from my colleague), it should be possible to handle more scenarios, and iterate/experiment faster on new AI models.


Training pipeline

How can we improve the training pipeline? With our current setup, there are a number of manual steps: ssh’ing into a GPU-heavy EC2 instance, running the scripts, copying the results back. It may be possible to use AWS SageMaker or another ML framework to make it very simple to ingest videos and output a trained model/weights for them


CNN model

How can we improve the model itself? Currently it’s trained by cutting the input video up into png’s, downscaling bicubically, and upscaling bilinearly. Can we use the entire video without cutting it up, which would increase the amount of input data? Can we modify the model or use a different type of model (like a GAN or echo network) that could train not just on single pngs/frames, but on the changes between frames? Is the current ffmpeg downscaling/upscaling optimal for our goals?

Conclusion

Real-time Super Resolution is a very new technology, but with some effort it could be a very big step forward in video quality and bandwidth-saving for video streaming providers. If you’re interested in experimenting with the process described in this post, you can clone the repository here (forked from Nick Chadwick’s work). You can also contribute changes to tensorflow.js as described in the linked Github issue above. Thanks for reading!

···

Job Opportunities

Check current job openings

Find Us

  • 1595 Bedford Hwy, Suite 168
  • Bedford, NS B4A 3Y4
  • Canada

Located in Sunnyside Mall, near Pete's.

View on Google Maps

Business Inquiries

Media & Entertainment

Learning

Defence