In the past few months, we have seen several exciting announcements from both Microsoft Build and Google I/O 2018. In these events, Microsoft announced its Custom Vision service, while Google also announced Cloud AutoML service with AutoML vision is the first product to be released. As the time of writing, I still couldn't get an access to Google Cloud AutoML, but fortunately, Microsoft Custom Vision is ready for open preview. As far as I know, Amazon doesn't really have a similar counterpart, as the closest one is Amazon Rekognition but it doesn't support custom training. While those services above are quite new, the concept itself has existed since several years ago. For example, Clarifai, an image-recognition service company founded in 2013, has also offered a similar service.
Now, let's get back to the main business. If you want to compare image labeling capabilities of those vendors, goberoi has written a very nice write-up in his blog post and shown several examples here. So, what we want to do here is doing some cute comparison between those services. Cute here means, ... *drum rolls* ... anime characters recognition comparison!
Caveat: I am planning to update this post with Google Cloud AutoML result when it is ready for open preview :)
Initially I want to go all-out with 100 characters comparison, but free version of Clarifai has a limit of 10 "Custom Concepts" (simpler term: tags) & 5000 free operations, while Azure Custom Vision Preview has a limit of 50 tags & 5000 training images. The lowest limit here is 10 "Custom Concepts", which means I can only try up to 10 characters for free. For each character, I am using around 35 to 45 images. In addition, these images are pre-processed the same way on how I did it on MoeFlow, since running classification directly without localizing object detection simply doesn't work.
In total, there are 10 categories and 407 images. At the end of the experiment, ~1k operations are billed in Clarifai.
During project initialization, you will be prompted to choose between classification or object detection as its project type. In this experiment, we will choose "Classification" in the project type and "General" in the domain part.
Afterwards, I only need to upload all of my images with proper tag, click the "Train" button, and voila, my model is ready to go!
While the user interface is simple, it doesn't really give you any other information in case you want to do some further analysis with it. By clicking on the tag URL, it will give you the predictions of each training data that we have supplied beforehand.
Images with red box mean they are below the specified threshold. While we are interested in top-1 accuracy, the probability threshold doesn't really give us the necessary information because there are cases where the image is under a certain threshold while it's still in the top-1 prediction rank. For example, in the screenshot above, "alice_cartelet" is still shown as top-1 prediction rank, with a probability of 34%. Since Azure Custom Vision assumes every input is an open domain problem, it makes sense that some of the images could not be classified properly (under the probability threshold). However, I'll say it will be nice if they could provide more flexibility to the users (e.g.: Clarifai in the next section).
Clarifai UI is a bit less intuitive, since I missed several features on my first trial. For example, there are 2 interesting toggles at the right panel: "Concepts Mutually Exclusive" and "Closed Environment". From the documentation:
When interpreting the evaluation results, keep in mind the nature of your model. Specifically, pay attention to whether or not you have labeled the inputs with more than 1 concept (i.e. non-mutually exclusive concepts environment), vs. only 1 concept per image.
Since each characters are categorized under a specific concept, "Concepts Mutually Exclusive" is turned on.
Afterwards, we can start training our model by clicking a "Train Mo..." button. After a minute or so, the model is ready to go and you can click on "View" button to get better details of training performance.
You will get a nice looking table, where you can adjust the "Prediction Threshold" to get the recall and precision rate for each concepts.
The provided matrix is very helpful to see why our model works or fails. For example, it is shown here that asuna is mis-predicted as asahina_mikuru twice. From human perspective, this is also understandable since both of them have similar hair and eye color.
To get a quick grasp of how each of those services behave, let's compare it with a same test image. For the readers who don't have prior knowledge of this character, the proper tag should be "aragaki_ayase".
Azure Custom Vision gave back a wrong result. The correct tag is shown in the 5th rank with a probability of 5.7%. Nevertheless, it might be unfair to compare this result with others since the result is shown in the open domain model (Probability total for all tags != 100%, there's a probability of classification failure).
Clarifai gave a correct result (aragaki_ayase) with a probability of 0.59 (out of 1). In addition, it provided us with some general tags such as illustration (0.97), woman (0.96), fun (0.94), fashion (0.94), young (0.93), and face (0.90).
For this test, I simply throw the test image to the existing model (trained with 100 categories). It still gave a correct prediction: aragaki_ayase (0.55), hyoudou_michiru (0.22), and miki_sayaka (0.08).
Azure Custom Vision is currently still in the preview mode, so it will be an exciting journey to see whether it is able to catch up with the current market contenders or not. UI-wise, Azure Custom Vision is quite simplistic and easy to use, but at the same time, it lacks the flexibility for more advanced users. For example, Clarifai allows users to specify whether the concepts are mutually exclusive and whether the problem is in a closed environment. Tables and graphs are very helpful for understanding the bottleneck of current model.
Clarifai model performance is undeniably good. The top-1 accuracy for 10 categories is around 95.2%, leaving current MoeFlow model (88.6%) and Azure Custom Vision (77.4%) behind. Please note that this experiment only compares top-1 accuracy in a mutually exclusive environment. From another experimentation here, while Clarifai gave a very good labeling result, you might want to consider other services if you have different requirements for your project.
Since the model that I'm using for building MoeFlow is based on Google Inception v3 (transfer learning), I am expecting Google Cloud AutoML to have similar or even better performance than 88.6%. From Google's blog post, Cloud AutoML looks very promising. It will be very interesting to see whether it's able to compete with existing competitors such as Clarifai or not, so stay tuned to the next update!