Online presence at a much lower bitrate

BitGen2D is a high fidelity video codec dedicated to video conferencing and talking-head video. It proposes a photorealistic experience while maintaining a bitrate 3 to 5 times lower than standard video codecs. Such substantial bitrate reduction is a commodity that enables:

  • 5x lower video latency
  • 5x lower transceiver power
  • uninterrupted remote presence under poor wireless signal conditions
  • significantly better Quality of Experience, increased user engagement time

NVENC HEVC
320 kbps
NVENC HEVC
64 kbps
BitGen2D
64 kbps

Neural Warping with any video codec

BitGen2D relies on two main components:

  • A neural warp engine that can warp a reference frame to the same pose as a target frame.
  • A Region of Interest (RoI) pipeline that extracts, transmits and blends the RoI into the warped frame.
The hybrid aspect of BitGen2D is here to alleviate the potential uncanny valleys inherent to generative AI for video representations. While the warp engine can render photorealistic frames, it sometimes struggles to catch some details in the facial expressions. However when it comes to communication, humans are particularly astute to details which are mainly driven by eyes and mouth. In BitGen2D, we propose to directly send those regions of interest through any standard codec (nvenc AVC, HEVC, AV1...) and blend them into the warped frame. Thus, it guarantees an accurrate rendition of the facial expressions while limiting the bitrate requirements.
The concept of BitGen2D encoding-decoding is illustrated below:

A robust and user agnostic solution

BitGen2D provides a high reliability as it can accurately render the various events that can happen during a video call such as hand movements and foreign objects entering the frame. It is also user agnostic as it does not require any fine-tuning on a specific identity. This enables BitGen2D as plug-and-play for any user and can also withstand situations like a change in appearance (clothes, accessories, haircut…) or a switch in speaker during a live call.

Showcasing Bitgen2D Encoding-Decoding on a laptop (Extract from the GTC 2023 presentation).

Generative speaker content anchored in the speaker's real appearance

If the user provides BitGen2D with a calibrated front-facing selfie photo, this can be used as the reference frame for the warp engine, allowing the user to use a different appearance during a video conference call. An example is shown in the video below. This allows for generative AI content to change non-essential features of the speaker like clothes and hairstyle, while still using ROI data from the speaker's real appearance, which avoids person reenactment and deep fakes.

Actual Video Captured
BitGen2D Generated Video

Current benchmarks

Quality and Bitrate

In the video below, we showcase a visual comparison between BitGen2D and nvenc HEVC at equivalent quality and equivalent bitrate.

NVENC HEVC
280 kbps
NVENC HEVC
84 kbps
BitGen2D
81 kbps

The quality of BitGen2D was assessed by an evaluation panel. Following the lab-based P.910 protocol, we built a test set of talking-head clips encoded and decoded with either BitGen2D, nvenc HEVC at equivalent quality and nvenc HEVC at equivalent bitrate. Those clips were then displayed in a random playlist and for each clip, the tested subject scored it between 1 and 5, with 1 being "unwatchable" and 5 being the top quality.
Each clip last about 10s with a resolution of 768x768p.
Testing was done under the following condition: 1080p monitor between 15 and 24", user at a 3 screens height distance, well lightned room with no reflection on the screen.

Test Video Resolution
768p@25fps
Average Bitrate
(kb/s, ↓ is better)
P.910 MOS
(↑ is better, max. is 5.0)
nvenc HEVC
@ same quality
280 4.0 ± 0.3
nvenc HEVC
@ same bitrate
84 2.0 ± 0.3
BitGen2D + nvenc HEVC 81 3.8 ± 0.3
ITU-T P.910 Absolute Categorical Reference with Hidden Reference (ACR HR), 5 point scale, 18 raters, 95% confidence interval is reported (P.910 official recommendations).

A sample of the P.910 clips can be downloaded here: will be provided soon.


Hardware

Nvidia

Below we report the hardware consumption of BitGen2D on Nvidia GPU. Benchmarked on a Razer Blade 17 (2021) Window 11 Laptop with:

  • Nvidia RTX 3080 mobile 16 GB, CUDA 11.3
  • Intel i9-11900H @2.50GHz
  • Memory usage: GPU VRAM ~1.8GB, system RAM ~3.3GB (non optimized)


GPU
nvidia-smi
CPU
task manager
DeepSpeed Profiler
Component GPU FPS usage (%) usage (%) GMAC/frame # of params
End-to-end Yes 25 30 12 52.9 47 M
Encoder Yes 119 98 6 15.6 23 M
Decoder Yes 66 83 12 18.8 23 M
Upscaler Yes 806 97 7 18.8 668 K
Perceptual Sc. Yes 12547 29 12 0.7 656 K
RoI Extraction No 93 0 11 0.1 2.5
RoI Blending No 299 0 11 0.1 30 K

Web application

A live web demo and a standalone application for NVIDIA RTX-enabled devices are available upon request and evaluation framework approval.

About Us

iSIZE is an AI-based video streaming systems company based in London, UK. For more information, see www.isize.co.