Flag #15 - Image Recognition for Anime Characters

Iskandar Setiadi

Flag #15 - Image Recognition for Anime Characters

Note: This post is a continuation of Flag #8 - First Experimentation to Image Processing with TensorFlow which is written a year ago.

Prologue

Moving forward, we understand more and more that image recognition is hard, even for humans. An experimentation by Andrej Karpathy shows that he could only achieve 5.1% top-5 error rate in ILSVRC dataset. In the ILSVRC 2015 paper (Russakovsky, O, et. al.), the following image shows several challenges in image classification.

Last year, I decided to try TensorFlow after proposing a problem: "Is there a way to recognize 2D or anime characters?" From that point onward, I have spent several weekends with several booming technologies, which I want to share here :)

2D Character Equals to 3D Human

My initial naive idea is considering 2D anime character is equal to 3D human. This is proven wrong later on (~~2D > 3D~~), and you will learn why it simply does not work by the end of this section. To start with, I tried using OpenFace library to classify my training images. I prepared 3 characters with 40 images each, which was very low in number for image recognition standard.

How OpenFace works (Ref: https://raw.githubusercontent.com/cmusatyalab/openface/master/images/summary.jpg)

Initially, OpenFace will try detecting faces with a pre-trained models from dlib or OpenCV. It can be done with either LBP cascade or Haar cascade. After that, it will transform the face with dlib's real-time pose estimation and OpenCV's affine transformation to re-position several facial features such as eyes and lips. Then, it will resize the given image to a similar dimension of pixels before applying deep neural network to it. The reason why it fails: 2D anime character facial features are different from human beings. To start with, 2D anime characters don't have nose, usually :p

68 face landmarks in human beings (Ref: https://www.pyimagesearch.com/wp-content/uploads/2017/04/faciallandmarks68markup-768x619.jpg)

At this point, it clears where it goes wrong. There is no detected faces from the pre-trained models above. Fortunately, nagadomi, the creator of waifu2x, has created a face detector for anime characters using OpenCV which is based on LBP cascade. I tried to compare it with OpenFace in https://github.com/freedomofkeima/opencv-playground, and the result is quite satisfying.

With OpenFace -- no face is detected (top); With Nagadomi's LBP Cascade (bottom)

The Dark Age

Here came the time when I struggled with various ideas. After correctly detecting face from 2D characters, what can we do next? I tried to read some scientific papers such as "Face Detection and Face Recognition of Cartoon Characters Using Feature Extraction" written by Takayama, K, et. al. in 2012. At that time, they tried to extract several features such as hair color, skin color, and hair quantity. They achieved 74.2% true positive, 14.0% false positive, and 11.8% false negative from the proposed method.

However, feature extraction is very painful since each features need different extraction strategy and I absolutely cannot finish it on my weekend time. In addition, hair color, eye size, eyes color, etc are totally dependent to artist styles.

The other idea is to apply IQDB strategy of searching source images: checking image similarity between input image and its database since 2D art style is "limited" (around 10M in total) and there's small variations to it. However, this doesn't work with slight modification to the image, such as cropped images. For example, an original image returned an accuracy of 98% from IQDB, however, a 20% cropped image only returned an accuracy of 43%, and more than that, it failed.

Deep Learning & Transfer Learning to the Rescue

The most satisfying result that I encountered so far is to use transfer learning. First of all, the challenge is obvious, some animation characters have a limited number of data set since the character is new or unpopular.

Transfer Learning Diagram. Ref: bottom image here

At this point, we can consider each 2D character faces as an "object". Inception-v3, a deep learning model based on ILSVRC winner, is already trained with million images and battle-tested for image classification. If you are interested in applying transfer learning to your dataset, you can learn it more here. The following experiment is done with TensorFlow.

For the test data, I also used the same tests data with all previous experiments above: 7 images which should be able to get proper category and 3 images which are not available in the existing category. Surprisingly, 8 out of 10 are categorized properly with > 0.94 threshold, 1 out of 10 is false positive, and another 1 out of 10 is false negative. Normally, the other experiments only resulted in 4 out of 10 correct categorization with bad number of threshold, but with transfer learning, the threshold is clear and it gives a very good number of accuracy. Since the training data is very small, this experiment only tries to compare top-1 error rate, while we can actually show several probabilities (e.g.: top-5 similarity).

Test results (top); Failure because of similar image style to existing character (bottom)

Afterwards, I tried to use transfer learning directly (same images) without LBP Cascade & resizing pre-processing. As it has been hypothesized before, it simply does not work since the number of features in a full image is too sparse.

Remarks

It has been an interesting journey and it's still far from completion. From this point, it's time to scale up the number of training sets and see how the result will unfold. If the result is quite promising, there are several ideas to expand: create a website to recognize an uploaded image, create a website to assist data set labeling, auto-sort downloaded 2D anime character images, and so on. Thank you for reading!

Update (November 11, 2017): More results are out! https://github.com/freedomofkeima/transfer-learning-anime

Update (December 3, 2017): Check https://freedomofkeima.com/moeflow/ for quick demonstration with 100 classes, trained with < 50 images per category. Currently, it has an accuracy of 70%.

Iskandar Setiadi
Freedomofkeima's Github