Training a performant object detection ML model on synthetic data using Unity Perception tools

Supervised machine learning (ML) has revolutionized artificial intelligence and has led to the creation of numerous innovative products. However, with supervised machine learning, there is always a need for larger and more complex datasets, and collecting these datasets is costly. How can you be sure of the label quality? How do you ensure that the data is representative of production data? An exciting new solution to this problem, particularly for object detection tasks, is to generate a massive synthetic dataset. Synthetic data alleviates the challenge of acquiring the large labeled datasets needed to train machine learning models.

This blog post is the third in our series on generating synthetic data with Unity. In the first blog post, we discussed the challenges of gathering a large volume of labeled images to train machine learning models for computer vision tasks.  More recently, we showed you how to generate labeled data frames using Unity’s perception tools.

Now, we will show you how to:

Generate a large dataset of desired objects in novel environments using Unity Simulation Train an object detection model (i.e., Faster R-CNN) using a synthetic dataset. Fine-tune this model on a small number of real-world examples.

The result is a model that performs well on a new real-world dataset we’re releasing with this post and performs better than the model trained using only real data. We will also provide you with pipelines and instructions to create an environment, generate data, and train a model with your customized assets and data in datasetinsights.

Generating a synthetic dataset at scale using Unity Simulation

We have chosen to use a Faster R-CNN to detect 63 everyday grocery objects. Training this model requires creating 3D assets of the objects of interest, automatic scene creation, rendering image data, and generating the bounding box labels.

We created 3D asset scans for all 63 objects for this project and used the Unity Perception package to generate labeled data automatically. As described in a previous blog post, we controlled the placement and orientation of the target objects along with the arrangement, shape, and texture of the background objects for each render. Additionally, we randomly chose lighting, object hue, blur, and noise for every rendered image. Using the Perception package, we captured RGB images and JSON files with corresponding bounding boxes.

Figure 1: Examples of synthetic data with bounding boxes and associated meta-data JSON.

To create our environment, we had two types of assets: foreground assets and background assets. Foreground assets are scans of the objects we detected. In contrast, background objects make up our background or occlude our target objects (distractors).

Crafting these assets had a set of unique challenges. Firstly, for background and distractor objects, we needed to expose and vary their textures and hues. Secondly, the foreground assets had to be realistic. Therefore, scanning foreground objects required more attention and some touch-up post-scan.

Creating a real-world dataset

To create the real-world dataset, we purchased the products, arranged them in several different locations in our Bellevue office, and took several pictures. To ensure that our real-world dataset was diverse, we placed objects with varied lighting and background conditions. We also ensured that the set of objects in each photo varied and that the object locations, orientations, and configurations varied in every shot.

To annotate these images,

Continue reading

This post was originally published on this site