Abstract. I extended the animation-from-a-single-image neural network system I created in 2019 so that the characters can make more types of facial expressions. While the old system can only open/close the eyes and the mouth, this new version affords more eye/mouth shapes and can control the eyebrows and the irises. They allow a character to show various emotions and give more convincing impression of speech.
|Disgusted||Condescending||Uwamedukai [footnote]||Gangimari-Gao [footnote]|
The character is Otogibara Era (© Ichikara Inc.).
Uwamedukai (上目遣い) is Japanese for the pose where a shorter person looks at another taller one with upturned eyes while tilting the face down. See this link for more examples.
Gangimari-Gao (ガンギマリ顔) is a facial expression where a character glares at the viewer with the eyes wide open and the irises reduced in size while smiling. The disconcerting, if not borderline insane, look gives the impression that the character is high on drugs (キマっている). The expression is popularized by virtual YouTuber Tsunomaki Watame. See her in action here.
With the new network, I can drive character illustrations with motions authored for 3D models.
I also created a real-time motion transfer tool that provides more controls over the character's face.
I modified the tool to record my motion and was later able make multiple characters talk and sing with more dynamic lip and face movements.
With the goal of making it easier to become a virtual YouTuber (VTuber), in 2019, I created a neural network system that can animate the face of any existing anime character, given only an image of it. The system, however, cannot yet be considered practical for becoming a VTuber. The most important shortcoming is that it can only close the eyes and mouth, robbing the character the ability to make most facial expressions. Characters used by professional VTubers, on the other hand, can deform the eyebrows, eyelids, irises, and mouth into various shapes. My goal in this article is to improve my system's expressiveness by increasing the types of movements it can produce.
My neural network system takes two inputs. First is a head shot of a character looking straight at the viewer, and second is a six-dimensional pose vector that specifies the pose the user wants the character to take. It outputs another image of the same character taking the specified pose. By varying the pose vector over time, the character can be animated. It can perform six types of movements because the pose vector is six-dimensional. However, excluding rotating the face, it can only closes its eyes and mouth.
The system poses a given character in two steps, each carried out by its own separate subnetwork. The face morpher closes the eyes and the mouth, and the face rotator rotates the face.
|Figure 2.1 An overview of how the 2019 system poses a character's face. The character is Kizuna AI (© Kizuna AI).|
To increase types of movement, I started by preparing larger datasets. From the collection of approximately 8,000 3D models I collected for my last system, I identified 39 common movements of facial parts and generated new datasets containing them. (You can see the list of movements here.) The movements encompass all the four movable facial features (eyebrows, eyelids, irises, and mouth) that can be observed in industrial characters. The size of the pose vector increased from 6 to 42 as a result [footnote].
Before, the pose vector has 6 dimensions with 3 dimensions used to control facial expression. With new types of movements incorporated, we need 39 dimensions instead, making the total length $39 + 3 = 42$.
To deal with larger pose vectors, I propose a new architecture for the face morpher network, the overview of which is depicted in Figure 2.2.
|Figure 2.2 An overview of the new face morpher architecture. It morphs the face in two steps: the first morphs the eyebrow, and the second morphs the eyes and the mouth. The character is Tokino Sora (© Tokino Sora Ch.).|
The new face morpher has two subnetworks: the eyebrow morpher and the eye & mouth morpher, with each network deforming the organ(s) in its name. The pose vector is divided into parts that can be fed into the relevant subnetworks.
The eyebrow morpher first segments out the eyebrows with a dedicated subnetwork called the eyebrow segmenter. It then uses another subnetwork called the eyebrow warper to deform eyebrow and then composite the result back to the original image.
|Figure 2.3 An overview of the architecture of the eyebrow morpher.|
The two networks have similar structures. It contains an encoder-decoder network that turns the input image(s) and the (optional) pose vector into an intermediate feature representation, which is then used to perform several image manipulation steps. I employ three types of image manipulation, each encapsulated into a reusable neural network unit.
The eyebrow segmenter does its job with two partial image changes. The eyebrow warper deforms the extracted eyebrows with a warp and a partial image change. It then combines them back to the face image. Their architectures are given in Figure 2.4 and 2.5.
During research, I discovered that it was very important to process the eyebrows separately from other facial features. Network architectures that used the same network to morph all facial features blurred the eyebrows after morphing them. By having separate networks morph the eyebrows after segmenting them out, I introduced a strong bias to preserve the eyebrow pixels, yielding crisp results.
The eye & mouth morpher has a similar architecture to the previous two networks. After passing the input image (the output of the eyebrow morpher) and the relevant part of the pose vector to an encoder-decoder network, it performs the following image manipulation steps:
The above rather complicated process was a result of my iterating on the architecture. The first warping step is required to preserve high-frequency details of the irises. If partial image change were used, iris patterns drawn by artists would be lost.
The last step is necessary to produce artifact-free closed eyelids. If I were to deform the eyelids together with other facial features, they would be covered by the first warping step. I discovered that this led to small lines near the eyes being smeared, and the eyelids would be blemished as a result.
I applied my system to 200 images of VTubers and related characters to generate a short video clip for each, and I put all the videos together in the eyecatcher. You can watch the individual videos in the figure below.
|Image being animated||Video|
Below is a selection of characters making the 7 facial expressions shown at the beginning of the article.
From top to bottom, the characters are:
The above figure demonstrates the versatility of my system. It could handle both male and female characters with very different eye and face shapes. It sensibly deformed the eyes even when partially occluded by hair or seen through translucent glasses. It hallucinates plausible mouth shapes in cases where the input image has a closed mouth while my previous system would just leave the mouth as is.
Another strengths of my system is its flexibility: I can combine it with any source of pose parameters. I thus use it to create a number of content creation tools and fanvids.
First, I created a desktop application that allows the user to manipulate an anime character's facial expression and face rotation by dragging sliders. The resulting image can be saved for later use.
Second, I wrote a program that converts motions authored for 3D models into sequences to pose parameters, allowing me use them to drive 2D character illustrations. With this tool, I created 4 music videos.
Lastly, I created another tool that allows myself to control a character in real time. I use an iOS application called iFacialMocap to capture facial performance. The app uses the iPhone's depth camera and Apple's ARKit library to estimate around 50 blendshape parameters and streams them to a PC through a UDP connection. I wrote a receiver that converts the signals to pose vectors and feed them to my neural network system in real time, allowing me to have anime characters mimic my facial motion.
I also recorded myself saying a tongue twister, reading aloud a piece of Japanese text, and lip syncing three pieces of music. I transferred the motions to various characters to create the following videos.
Reciting the Medicine Peddler's Sales Pitch (外郎売)
Reading aloud Akutagawa Ryuunosukera's "Rail Truck"
Lip syncing Happy Birthday Class Rep Song (委員長おめでとうの歌)
Lip syncing GOMIKASU-Original Mix-
I hope I have convinced you through the videos that the characters lip sync and imitate my facial movements well. Moreover, my system preserves the beauty of the illustrations as it does not drastically distorts the faces unlike First Order Motion Model work [Siarohin et al. 2019], which is widely used to generate meme videos such as the one shown below.
you can have it I just got even better idea pic.twitter.com/LnCkMZK51K— Cakewalking 5555+1 (@Tortokhod) July 17, 2020
I have presented a new network architecture for changing facial expression of images of anime characters. It is capable of deforming the eyebrows, the eyelids, the irises, and the mouth, all of which are facial features important for conveying emotions. It produces good lip syncs and allows characters to express various emotions. It is a clear improvement over my 2019 system, which can only close the eyes and mouth. The whole system can be combined with any source of pose vectors, allowing me to easily create contents and tools as shown in the last section.
A major limitation of my approach is that the possible movements are limited to the common blendshapes found in 3D models. Hence, it is not yet possible to have anime characters imitate all types of human facial movements.
This project was born out of my desire to make the 2019 system more practical, and it successfully solves the lack of facial expressiveness. However, the whole system still has many problems. The model has become much bigger (from 360MB to 600MB) and thus slower, disoccluded parts after face rotation can use improvement, and many restrictions still exist on the input image. I will address these shortcomings in future projects.
Lastly, note that this article elides many details for brevity. Curious readers can find more information in the full (and much longer) write-up. Among other things, it includes detailed literature review, rationales for the network architectures, and comparisons with other previous works.
While I'm an employee of Google Japan, this project is my personal hobby which I did in my free time without using Google's resources. It has nothing to do with work as I am a normal software engineer writing Google Maps backends for a living. While I did computer graphics research in my previous life, I currently do not belong to any of Google's or Alphabet's research organizations. Opinions expressed in this article is my own and not the company's. Google, though, may claim rights to the article's technical inventions.