\( \def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \newcommand{\dee}{\mathrm{d}} \newcommand{\Dee}{\mathrm{D}} \newcommand{\In}{\mathrm{in}} \newcommand{\Out}{\mathrm{out}} \newcommand{\pdf}{\mathrm{pdf}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ve}[1]{\mathbf{#1}} \newcommand{\mrm}[1]{\mathrm{#1}} \newcommand{\ves}[1]{\boldsymbol{#1}} \newcommand{\etal}{{et~al.}} \newcommand{\sphere}{\mathbb{S}^2} \newcommand{\modeint}{\mathcal{M}} \newcommand{\azimint}{\mathcal{N}} \newcommand{\ra}{\rightarrow} \newcommand{\mcal}[1]{\mathcal{#1}} \newcommand{\X}{\mathcal{X}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\Z}{\mathcal{Z}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\z}{\mathbf{z}} \newcommand{\tr}{\mathrm{tr}} \newcommand{\sgn}{\mathrm{sgn}} \newcommand{\diag}{\mathrm{diag}} \newcommand{\Real}{\mathbb{R}} \newcommand{\sseq}{\subseteq} \newcommand{\ov}[1]{\overline{#1}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \)

 

Talking Head Anime 4:

Distillation for Real-Time Performance

WACV 2025

 

Pramook Khungurn

 

Paper arXiv Code Demo #1 Demo #2

 


Using knowledge distillation [Hinton et al. 2015], we can compress the movement of a specific character into a small (< 2 MB) neural network model that can generate 512$\times$512 animation frames at real-time frame rates using consumer gaming GPUs. The model is so lightweight that it can run entirely in web browsers and still produce smooth animation.

 

Abstract

We study the problem of creating a character model that can be controlled in real time from a single image of an anime character. A solution would greatly reduce the cost of creating avatars, computer games, and other interactive applications.

Talking Head Anime 3 (THA3) is an open source project that attempts to directly address the problem. It takes as input (1) an image of an anime character's upper body and (2) a $45$-dimensional pose vector and outputs a new image of the same character taking the specified pose. The range of possible movements is expressive enough for personal avatars and certain types of game characters.

THA3's main limitation is its speed. It can achieve interactive frame rates ($\approx$ 20 FPS) only if it is run on a very powerful GPU (Nvidia Titan RTX or better). Based on the insight that avatars and game characters do not need to change their appearance every so often, we propose a technique to distill the system into a small student neural network (< 2 MB) specific to a particular character. The student model can generate $512\times512$ animation frames in real time ($\geq$ 30 FPS) using consumer gaming GPUs while preserving the image quality of the teacher model. For the first time, our technique makes the whole system practical for real-time applications.

Method

The THA systems, including the 4th version that we propose in this paper, is overly capable. At any time, we can change the chracter image and animate it immediately. However, our target use cases—virtual YouTubers (VTubers) and game characters—does not change their appearance every second and every minute to warrant this functionality. By creating neural networks that are specialized to a specific character image, we may obtain faster models that work under real-time constraints.

The faster models in question are created by knowledge distillation, the practice of training a machine learning model (the student) to mimic the behavior of another model (the teacher). In our case, the teacher model is the improved THA system. The student is a collection of two neural networks. The face morpher modifies the character's facial expression, and the body rotator rotates the face and the toros.


The student model.

Architecture-wise, the face morpher is a SInusoidal REpresentation Network (SIREN) [Sitzmann et al. 2020]. The body morpher is also a SIREN but with two modifications. First, it generates images in a multi-resolution, starting with 128$\times$128, then 256$\times$256, and then 512$\times$512. This modification makes it fast enough to achieve real-time frame rates. Second, it uses image processing operations such as warping and alpha blending. This modification makes it preserve the details of the input character image better.


The student face morpher.

The student body morpher.

Details such as how to train the student model and how it performs against the teacher are available in the paper.

Demos

We provide two demo web applications.

  1. The manual poser demo allows you to pose characters by manipulating UI widgets, mainly sliders.
  2. The webcam demo allows you to control characters with your own facial and bodily movement. This demo needs a web camera to function.

The demos are best run on a computer with a dedicated gaming GPU. We were able to get real-time frame rates with an Nvidia GeForce 1080 Ti.

Supplementary Material

The supplementary material for the paper is available in this web page.

Credits

To create the demos, we use illustrations by 3rd party creators.

We use three illustrations from 東北ずん子・ずんだもんプロジェクト by SSS LLC. They are:

The original illustrations be found with the E-mote models distributed by M2, Inc. The terms of use for these illustrations can be found here.

We use 7 illustrations by Mikatsuki Arpeggio (三日月アルペジオ). They are:

All the illustrations have been resized and cropped to the specification of our models. The terms of use for these illustrations can be found here.

We thank the creators for generously providing the illustrations for us to build upon.

While the neural networks are developed with PyTorch, the web demos use TensorFlow.js with custom units we develop ourselve. We use Mediapipe FaceLandmarker to perform blendshape parameter estimation from webcam feed.

 

Update History

Project Fuji