[3DPose estimation AI #1] Introduction

full hd image from [3DPose estimation AI #1] Introduction
Upload Date 16.12.2018
Blog Post Author Kilian Mehringer
Job Description Software Engineer
Co Authors Sebastian Lack

[3D Pose Estimation AI #1] Introduction to a DeepLearing approach on RGB only.

The Idea:

In this Project we want to create a DeepLearning approach to get a first 3D Pose estimation of an arbitrary object in only one given RGB only Image. The object we want to estimate its 3D Pose of is given as an obj File with textures.

What we want as final result is a 3D Pose in Camera Coordinates so we can render a BoundingBox or the Objekt it self on top of the image and replace the original Image in this way. As our first goal we only want it to process on one Frame only and not on a video stream. At the same time we are completely happy with an execution time under 10 sec. We don't want to process images in real time ( > 30fps).

Why? the hell do we want to make something like this? The applications for software like this are mainly in robotic vision, where a robot can grab objects with the calculated pose, or some mixed/augmented reality applications which need to know where objects are in the scene to replace them with virtual ones.

These are great applications and very interesting, but they all need some sort of real time approach and we don't know if we are able to get even close to real time execution. There for our goal is to learn more about DeepLearning Methods and how to create them. DeepLearning and the whole AI Thing is extremely popular at this moment. For me the most important and most interesting thing is not this magical AI programming stuff if you know some more about it the magic fades away quickely. The most interesting question in this project at least for me is: Can we generate the whole dataset synthetically in Blender3D with products and object i modeled, textured and shaded in the last years? And if we can pull this of to generate this photo realistic Datasets how do they perform and do we have a chance to outperform existing Models who have been trained on synthetically non photo realistic images or is the effort of generating photo realistic images wasted time for machine leraning datasets.

Existing Work:

Like i slightly mentioned before there are some very intelligent people out there who already did this with stunning results. The Papers i know on this topic are all motivated by real time execution. The first one is a Paper from "Wadim Kehl" a ComputerVision and MachineLearning researcher at Toyota. To find out more details his website is here

In short they researched a way to train a DeepNeuralNet to create a first estimation or something like a guess for a 3D Pose of an Object in a given Image. They where able to achieve some robust and fast results that can be considered real time if you want to call 10FPS real time.

Wadim Kehl also had some influence in another Paper this time for Pose refinement. So additionally to the first estimation there are solutions out there to make this first guess more accurate trough multiple iterations. This one is a DeepLearning approach as well, and can be trained on the same synthetic data as the estimation one. So if we have the time and chance to try this we also want to train a model for Pose refinement after our Pose estimation model. For our approach we will use mostly the same ideas and techniques. We only want to focus more on accuracy then on processing time. Also my focus is on feeding the network with images that are as photo realistic as possible.

The other Work we took a look at was a non learning approach. Its the work of our computer vision professor Ulrich Schwanecke who is supervising this project. He is the Head of the CVMR Group at the HsRm. See more

They choose a Gauss-Newton Approach which creates stunning real time performance. Sadly the Paper is extremely Math heavy. There for it is really hard to get whats happening there but luckily for us we can ask some questions personally. The preprint of the Paper can be downloaded here.

With their algorythm its possible to track objects at 50–100 Hz which can easily be considered real time. As well as every other technique, this one also needs to get its first pose estimation which takes way more time than tracking a already estimated pose trough multiple frames. But if an object is detected and a first pose is calculated the tracking can be done extremely quickly.

So as a conclusion DeepLearining is not the best approach for 3D Pose estimation. There are solutions out there that can out perform trained models fairly easy. But for us its the perfect way to see if we can get a working model with usable accuracy and processing time by only using synthetically generated Images.

DeepLearning, 3D Pose Estimation, ML, MachineLearning, Tesnorflow, Python, Blender3D, Computer Science, AI, Artificial Intelligence, Robotic Vision