\( \def\sc#1{\dosc#1\csod} \def\dosc#1#2\csod{{\rm #1{\small #2}}} \)


Talking Head Anime
from a Single Image 2:
More Expressive

Pramook Khungurn


The characters are corporate/independent virtual YouTubers and their related characters. Images and videos in this article are their fan arts. [footnote]


Abstract. I extended the animation-from-a-single-image neural network system I created in 2019 so that the characters can make more types of facial expressions. While the old system can only open/close the eyes and the mouth, this new version affords more eye/mouth shapes and can control the eyebrows and the irises. They allow a character to show various emotions and give more convincing impression of speech.

Input [copyright] Happy Sad Angry
Disgusted Condescending Uwamedukai [footnote] Gangimari-Gao [footnote]

With the new network, I can drive character illustrations with motions authored for 3D models.

I also created a real-time motion transfer tool that provides more controls over the character's face.

I modified the tool to record my motion and was later able make multiple characters talk and sing with more dynamic lip and face movements.


1   Motivation

With the goal of making it easier to become a virtual YouTuber (VTuber), in 2019, I created a neural network system that can animate the face of any existing anime character, given only an image of it. The system, however, cannot yet be considered practical for becoming a VTuber. The most important shortcoming is that it can only close the eyes and mouth, robbing the character the ability to make most facial expressions. Characters used by professional VTubers, on the other hand, can deform the eyebrows, eyelids, irises, and mouth into various shapes. My goal in this article is to improve my system's expressiveness by increasing the types of movements it can produce.

2   Summary of Approach

My neural network system takes two inputs. First is a head shot of a character looking straight at the viewer, and second is a six-dimensional pose vector that specifies the pose the user wants the character to take. It outputs another image of the same character taking the specified pose. By varying the pose vector over time, the character can be animated. It can perform six types of movements because the pose vector is six-dimensional. However, excluding rotating the face, it can only closes its eyes and mouth.

The system poses a given character in two steps, each carried out by its own separate subnetwork. The face morpher closes the eyes and the mouth, and the face rotator rotates the face.

Figure 2.1 An overview of how the 2019 system poses a character's face. The character is Kizuna AI (© Kizuna AI).

To increase types of movement, I started by preparing larger datasets. From the collection of approximately 8,000 3D models I collected for my last system, I identified 39 common movements of facial parts and generated new datasets containing them. (You can see the list of movements here.) The movements encompass all the four movable facial features (eyebrows, eyelids, irises, and mouth) that can be observed in industrial characters. The size of the pose vector increased from 6 to 42 as a result [footnote].

To deal with larger pose vectors, I propose a new architecture for the face morpher network, the overview of which is depicted in Figure 2.2.

Figure 2.2 An overview of the new face morpher architecture. It morphs the face in two steps: the first morphs the eyebrow, and the second morphs the eyes and the mouth. The character is Tokino Sora (© Tokino Sora Ch.).

The new face morpher has two subnetworks: the eyebrow morpher and the eye & mouth morpher, with each network deforming the organ(s) in its name. The pose vector is divided into parts that can be fed into the relevant subnetworks.

2.1   Eyebrow Morpher

The eyebrow morpher first segments out the eyebrows with a dedicated subnetwork called the eyebrow segmenter. It then uses another subnetwork called the eyebrow warper to deform eyebrow and then composite the result back to the original image.

Figure 2.3 An overview of the architecture of the eyebrow morpher.

The two networks have similar structures. It contains an encoder-decoder network that turns the input image(s) and the (optional) pose vector into an intermediate feature representation, which is then used to perform several image manipulation steps. I employ three types of image manipulation, each encapsulated into a reusable neural network unit.

  1. Partial image change. The feature tensor is used to produce an alpha mask and another image that represents changes to the original image. The mask and the change image are then used to perform alpha blending with the input image to partially modify it. I take this step from the ECCV 2018 paper by Pumarola et al., which successfully applies it to alter facial expressions in human photos [2018].
  2. Combining. The feature tensor is used to produce an alpha mask, which is then used to combine two images through alpha blending.
  3. Warping. The feature tensor is transformed into an appearance flow, a map which tells, for each pixel of the output, which input image pixel to copy from [Zhou et al. 2016]. The appearance flow is then used to warp another image as it tells where each pixel should be moved to.

The eyebrow segmenter does its job with two partial image changes. The eyebrow warper deforms the extracted eyebrows with a warp and a partial image change. It then combines them back to the face image. Their architectures are given in Figure 2.4 and 2.5.

Figure 2.4 Architecture of the eyebrow segmenter.

Figure 2.5 Architecture of the eyebrow warper.

During research, I discovered that it was very important to process the eyebrows separately from other facial features. Network architectures that used the same network to morph all facial features blurred the eyebrows after morphing them. By having separate networks morph the eyebrows after segmenting them out, I introduced a strong bias to preserve the eyebrow pixels, yielding crisp results.

Figure 2.6 The effect of using separate networks to segment, morph, and then composite the eyebrows as proposed above.

2.2   Eye & Mouth Morpher

The eye & mouth morpher has a similar architecture to the previous two networks. After passing the input image (the output of the eyebrow morpher) and the relevant part of the pose vector to an encoder-decoder network, it performs the following image manipulation steps:

  1. a warp to deform the mouth and the irises,
  2. a partial image change to retouch the output of the last step, and
  3. another partial image change to deform the eyelids.

Figure 2.7 Architecture of the eye & mouth morpher.

The above rather complicated process was a result of my iterating on the architecture. The first warping step is required to preserve high-frequency details of the irises. If partial image change were used, iris patterns drawn by artists would be lost.

Figure 2.8 The effect of the first warping step of the eye & mouth morpher on the quality of the irises. The character is Weatheroid Airi (© Weathernews Inc.).

The last step is necessary to produce artifact-free closed eyelids. If I were to deform the eyelids together with other facial features, they would be covered by the first warping step. I discovered that this led to small lines near the eyes being smeared, and the eyelids would be blemished as a result.

Figure 2.9 The effect of processing the eyelids with a partial image change in a separate step. Notice the blemish produced by the architecture with warping. It is the result of the network's dragging the eyelid down and thereby smearing the small line inside the ground truth's red box. My proposed architecture simply fills the space with a solid color, yielding an artifact-free image. The character is Yamato Iori (© Appland, Inc.).

3   Results

I applied my system to 200 images of VTubers and related characters to generate a short video clip for each, and I put all the videos together in the eyecatcher. You can watch the individual videos in the figure below.

Image being animated Video
Figure 3.1 Videos of characters being animated by my system.

Below is a selection of characters making the 7 facial expressions shown at the beginning of the article.

Figure 3.2 Facial expressions generated by my system. [copyright]

The above figure demonstrates the versatility of my system. It could handle both male and female characters with very different eye and face shapes. It sensibly deformed the eyes even when partially occluded by hair or seen through translucent glasses. It hallucinates plausible mouth shapes in cases where the input image has a closed mouth while my previous system would just leave the mouth as is.

Another strengths of my system is its flexibility: I can combine it with any source of pose parameters. I thus use it to create a number of content creation tools and fanvids.

First, I created a desktop application that allows the user to manipulate an anime character's facial expression and face rotation by dragging sliders. The resulting image can be saved for later use.

Second, I wrote a program that converts motions authored for 3D models into sequences to pose parameters, allowing me use them to drive 2D character illustrations. With this tool, I created 4 music videos.

TikTok's "Wink"

Alien Alien

Ochame Kinou

Otome Kaibou

Lastly, I created another tool that allows myself to control a character in real time. I use an iOS application called iFacialMocap to capture facial performance. The app uses the iPhone's depth camera and Apple's ARKit library to estimate around 50 blendshape parameters and streams them to a PC through a UDP connection. I wrote a receiver that converts the signals to pose vectors and feed them to my neural network system in real time, allowing me to have anime characters mimic my facial motion.

I also recorded myself saying a tongue twister, reading aloud a piece of Japanese text, and lip syncing three pieces of music. I transferred the motions to various characters to create the following videos.

Reciting the Medicine Peddler's Sales Pitch (外郎売)

Reading aloud Akutagawa Ryuunosukera's "Rail Truck"

Lip syncing Happy Birthday Class Rep Song (委員長おめでとうの歌)

Lip syncing GOMIKASU-Original Mix-

Lip syncing Baka Mitai (the song used in the Dame Da Ne meme)

I hope I have convinced you through the videos that the characters lip sync and imitate my facial movements well. Moreover, my system preserves the beauty of the illustrations as it does not drastically distorts the faces unlike First Order Motion Model work [Siarohin et al. 2019], which is widely used to generate meme videos such as the one shown below.

you can have it I just got even better idea pic.twitter.com/LnCkMZK51K

— Cakewalking 5555+1 (@Tortokhod) July 17, 2020

4   Conclusion

I have presented a new network architecture for changing facial expression of images of anime characters. It is capable of deforming the eyebrows, the eyelids, the irises, and the mouth, all of which are facial features important for conveying emotions. It produces good lip syncs and allows characters to express various emotions. It is a clear improvement over my 2019 system, which can only close the eyes and mouth. The whole system can be combined with any source of pose vectors, allowing me to easily create contents and tools as shown in the last section.

A major limitation of my approach is that the possible movements are limited to the common blendshapes found in 3D models. Hence, it is not yet possible to have anime characters imitate all types of human facial movements.

This project was born out of my desire to make the 2019 system more practical, and it successfully solves the lack of facial expressiveness. However, the whole system still has many problems. The model has become much bigger (from 360MB to 600MB) and thus slower, disoccluded parts after face rotation can use improvement, and many restrictions still exist on the input image. I will address these shortcomings in future projects.

Lastly, note that this article elides many details for brevity. Curious readers can find more information in the full (and much longer) write-up. Among other things, it includes detailed literature review, rationales for the network architectures, and comparisons with other previous works.


While I'm an employee of Google Japan, this project is my personal hobby which I did in my free time without using Google's resources. It has nothing to do with work as I am a normal software engineer writing Google Maps backends for a living. While I did computer graphics research in my previous life, I currently do not belong to any of Google's or Alphabet's research organizations. Opinions expressed in this article is my own and not the company's. Google, though, may claim rights to the article's technical inventions.

Special Thanks

I would like to thank Yanghua Jin, Alice Maruyama, Cory Li, Shinji Ogaki, Panupong Pasupat, Jamorn Sriwasansak, Yingtao Tian, Mamoru Uehara, and Jayakorn Vongkulbhisal for their comments.

Update History

Project Bougainvillea