[3D Pose estimation AI #2] Pipeline
|Blog Post Author||Kilian Mehringer|
|Job Description||Software Engineer|
|Co Authors||Sebastian Lack|
[3D Pose Estimation AI #2] Theoretical basics and pipeline design
We want to create a DeepLearning model for 3D Pose estimation. At the end it should be possible to have a 3D Model as an obj file with textures and an image of the object with an unknown pose as input and get a first guess for the parameters of the given object in the image.
To get this we have to work on some theoretical problems first:
- what is a 3D pose and how to describe it in a mathematical way?
- what is our coordinate system if we only have one image to estimate a pose?
- how could our pipeline look like to get a first 3d pose estimation from our DeepNeuralNet
How to define a Pose:
"If mathematics bores you pretty fast. Here is the short version: We will use one vector and a quaternion to describe our pose."
The first problem when it comes to tracking or estimating a 3D pose from images is the 3D pose itself. At first you might think: "No problem you just define 3 angles and a location for the pose save them and train the model to reproduce them."
Thats kind of what we try to do but there are some huge problems. The first is that defining an objects rotation by just angles like it would be possible with euler angles is not only a bad idea because they produce something called gimbal lock which means that there are some situations in which rotations on one axis get completely lost. Euler angles can also describe the same rotation by different constellations of angle values. And this makes them completely unusable for training. Its very unlikely to get useful results by backpropagating a neural network with values that can describe the same rotation by also being completely of from the labled ones.
So we had to describe our rotations in another way. We had to choose from two possible ways. The first one was to use good old quaternions. They are awesome in computer graphics so why shouldn't they be as great in computer vision?
The second way is to define a rotation by screws, which has some advantages but no one uses this things and there is no way for us to understand them and use them in our project by the small amount of documentation about them in languages we can read.
This makes it really easy for us at this point: "quaternions it is".
Quaternions can be though of like imaginary numbers. Where an imaginary number has a real and an imaginary part to extend a 2d rotation up into 3 dimensions a quaternion has a real part and 3 imaginary to extend up into 4 dimensions to get explicit definitions for rotations in 3d space.
A quaternion rotation can be described like this:
q = s + xi + yj + zk
If you want to learn more about quaternions and how they work take a look at this great videos.
In short the "s" value of the quaternion describes in some way the angle of the rotation and the (xi, yj, zk) imaginary part of the quaternion describes a normalized 3 dimensional vector. This vector is the rotation axis for this quaternion. Because they are projections from a higher dimension they behave strange sometimes and its very hard to tell what a quaternion will do by just looking at the values of s,x,y and z.
Our final Pose is described by two vectors. One is the location in 3d space and the other one is the rotation as quaternion. We marked the location vector with "l" and the quaternion vector with "q" and separated them with ";" so we can extract and split them easily from a string later on.
Our final pose definition looks like this:
l[x, y, z];q[s, x, y, z]
The Camera coordinate system
The next big thing we have to think about is coordinate systems. In computer graphics we have two coordinate systems most of the time. The first one is the "world coordinate system". Its the one we define our whole scene in. So for computer vision we can think of it as the 3 dimensional coordinate system we live in.
If we want to know how an object is oriented in this world coordinate system we need multiple images and the position if the camera in world coordinates. This is more or less what photogrammetry software wants to achieve. We only have one image or if we want to work on video streams we have a series of images, without any informations about the position of the camera in our world. What we can get from this image is the position of the object in proportion to our camera.
In computer graphics its part of the standard pipeline to calculate rotations and transformations in the world coordinate system. After that the object is transformed into the camera coordinate system. This is done by a so called modelview matrix. If we multiply each point of our model in 3d space with this matrix we get the coordinates of every point in our camera coordinate system.
This is very important to understand because we can only work in this camera based coordinate system for our project. So when ever we calculate, rotate or transform something we need to remind our selfs that we do it only in camera coordinates. And for our training data we always need to save our poses in camera coordinates, otherwise it would be impossible to get any useful results from it.
Now we have some of the basics done. There will be some more of this stuff later on. But nothing we have to much to worry about right now. The idea of a 3d pose as one vector and a quaternion with a rough knowledge of camera and world coordinates in computer graphics are enough to design our 3d pose estimation pipeline.
Like a render pipeline defines what to do in which order to render images from simple numerical data onto our screens we want a pipeline for our 3d pose estimation problem. A pipeline that defines what to do in which order. Something like a very very rough architecture for our pose estimation software.
"At this point its just an idea of what this pipeline could look like. We might have to redesign parts of it or the whole thing later on in the process."
In our pipeline we will start with an input image. This image contains the object we want to find and estimate its pose. The first thing we do is render the object rotated in different angles with the given obj file. This images will be our shape we want to find in the input image. Because we rendered this images via OpenGl we can calculate the rotation quaternion for every image.
Now we have our input image and a bunch of labeled template images. We can feed the input image and one of the rendered template images in to our trained model. The model searches for the template image in our input image. The output will be a bounding box to locate the template in the input image as well as a number that describes how much the template and the object in the estimated bounding box matches.
We will repeat this for every template image. After that we should have something like a list of predictions for every template image. To get the best bounding box we just have to search for the highest matching number in this list. This gives us the bounding box and the template image which matches best with the given input image. We can now look up the calculated rotation quaternion for this particular template image to get the rotational part of our pose.
To get our localization part we have to calculate it from the position of our bounding box. This looks very easy at first because the x and y axis are already given by the center of the bounding box and the z axis can be calculated by the scale of the bounding box. But we need to take the projection matrix into account by this calculation. And at this point we don't know how to deal with the projection matrix in this setup.
There are some other ideas for a pipeline that might be more efficient or easier to realize, but this one is the first we want to try. If we cant get any reliable results we will try other architectures we have in our minds.
Machine Learning, ML, DeepLearning, Tensorflow, 6DOF, ComputerVision, Computer Sience, Python, OpenCV, OpenGL, tensorflow