\( \def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \newcommand{\dee}{\mathrm{d}} \newcommand{\Dee}{\mathrm{D}} \newcommand{\In}{\mathrm{in}} \newcommand{\Out}{\mathrm{out}} \newcommand{\pdf}{\mathrm{pdf}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\ve}[1]{\mathbf{#1}} \newcommand{\mrm}[1]{\mathrm{#1}} \newcommand{\ves}[1]{\boldsymbol{#1}} \newcommand{\etal}{{et~al.}} \newcommand{\sphere}{\mathbb{S}^2} \newcommand{\modeint}{\mathcal{M}} \newcommand{\azimint}{\mathcal{N}} \newcommand{\ra}{\rightarrow} \newcommand{\mcal}[1]{\mathcal{#1}} \newcommand{\X}{\mathcal{X}} \newcommand{\Y}{\mathcal{Y}} \newcommand{\Z}{\mathcal{Z}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\z}{\mathbf{z}} \newcommand{\tr}{\mathrm{tr}} \newcommand{\sgn}{\mathrm{sgn}} \newcommand{\diag}{\mathrm{diag}} \newcommand{\Real}{\mathbb{R}} \newcommand{\sseq}{\subseteq} \newcommand{\ov}[1]{\overline{#1}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \)

 

Talking Head(?) Anime
from a Single Image 4

Improved Model and Its Distillation

 

Pramook Khungurn

 

Paper arXiv Code Demo #1 Demo #2

 


Using knowledge distillation [Hinton et al. 2015], we can compress the movement of a specific character into a small (< 2 MB) neural network model that can generate 512$\times$512 animation frames at real-time frame rates using consumer gaming GPUs. The model is so lightweight that it can run entirely in web browsers and still produce smooth animation.

 

Abstract

We study the problem of creating a character model that can be controlled in real time from a single image of an anime character. A solution to this problem would greatly reduce the cost of creating avatars, computer games, and other interactive applications.

Talking Head Anime 3 (THA3) is an open source project that directly addresses the problem. It takes as input (1) an image of an anime character's upper body and (2) a $45$-dimensional pose vector and outputs a new image of the same character taking the specified pose. The range of possible movements is expressive enough for personal avatars and certain types of game characters. However, the system is too slow to generate animation in real time on common PCs, and its image quality can be improved.

In this paper, we improve THA3 in two ways. First, we propose new architectures for subnetworks that rotate the character's head and body based on U-Nets with attention [Ho et al. 2020] that are widely used in modern generative models. The new architectures consistently yield better image quality than the baseline. Nevertheless, they also make the whole system much slower: it takes up to 150 milliseconds to generate a frame. Second, we propose a technique to distill the system into a small network (< 2 MB) that can generate 512 $\times$ 512 animation frames in real-time ($\geq$ 30 FPS) using consumer gaming GPUs while keeping the image quality close to that of the full system. This improvement makes the whole system practical for real-time applications.

Method

The THA systems, including the 4th version that we propose in this paper, is overly capable. At any time, we can change the chracter image and animate it immediately. However, our target use cases—virtual YouTubers (VTubers) and game characters—does not change their appearance every second and every minute to warrant this functionality. By creating neural networks that are specialized to a specific character image, we may obtain faster models that work under real-time constraints.

The faster models in question are created by knowledge distillation, the practice of training a machine learning model (the student) to mimic the behavior of another model (the teacher). In our case, the teacher model is the improved THA system. The student is a collection of two neural networks. The face morpher modifies the character's facial expression, and the body rotator rotates the face and the toros.


The student model.

Architecture-wise, the face morpher is a SInusoidal REpresentation Network (SIREN) [Sitzmann et al. 2020]. The body morpher is also a SIREN but with two modifications. First, it generates images in a multi-resolution, starting with 128$\times$128, then 256$\times$256, and then 512$\times$512. This modification makes it fast enough to achieve real-time frame rates. Second, it uses image processing operations such as warping and alpha blending. This modification makes it preserve the details of the input character image better.


The student face morpher.

The student body morpher.

Details such as how to train the student model and how it performs against the teacher are available in the paper.

Demos

We provide two demo web applications.

  1. The manual poser demo allows you to pose characters by manipulating UI widgets, mainly sliders.
  2. The webcam demo allows you to control characters with your own facial and bodily movement. This demo needs a web camera to function.

The demos are best run on a computer with a dedicated gaming GPU. We were able to get real-time frame rates with an Nvidia GeForce 1080 Ti.

Credits

To create the demos, we use illustrations by 3rd party creators.

We use three illustrations from 東北ずん子・ずんだもんプロジェクト by SSS LLC. They are:

The original illustrations be found with the E-mote models distributed by M2, Inc. The terms of use for these illustrations can be found here.

We use 7 illustrations by Mikatsuki Arpeggio (三日月アルペジオ). They are:

All the illustrations have been resized and cropped to the specification of our models. The terms of use for these illustrations can be found here.

We thank the creators for generously providing the illustrations for us to build upon.

While the neural networks are developed with PyTorch, the web demos use TensorFlow.js with custom units we develop ourselve. We use Mediapipe FaceLandmarker to perform blendshape parameter estimation from webcam feed.

 

Update History

Project Fuji