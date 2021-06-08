3. Quick prototyping with a small dataset Data augmentation provided by TAO Toolkit and pretrained models from the NGC catalog can be used together for quick prototyping with as few as 100 images.

3.1 Challenges with data collection for training AI models The old adage of "garbage in equals garbage out" holds true in the world of AI. You need large high-quality datasets when training models from scratch. There are several challenges with data collection: Collecting and labeling data is time consuming, labor intensive, and expensive

Using small datasets can lead to poorly performing models as they lack the variety needed to train robust model successfully

In some cases, data may simply not be available or restricted (e.g. patient medical x rays and scans) Take for instance a case where detecting defects on a PCB assembly line is a critical task for ensuring quality and discovering errors in the assembly process. However, the rate at which defects occur in PCB assembly is so low that it can take months or years to collect enough images to train an accurate model. The challenge of collecting enough data spans many industries when trying to detect anomalies. There are few ways to overcome these challenges: Synthetic data generation : Synthetic data is annotated information that computer simulations or algorithms generate.

Data augmentation: Augmenting your dataset adds more variability and randomness that enables model generalization, which improves accuracy on data that the model has never seen before. Both these methods are significantly cheaper and faster than collecting more data. For this experiment, we'll look at the data augmentation feature in the TAO Toolkit.

3.2 What is data augmentation? Data augmentation takes an existing dataset and applies transformations in the spatial and color domain to create new images that are similar but different enough from the original to generalize the model and add variability. Much research has been done to determine the most effective types of augmentation techniques. Common transformations include translation, rotation, and color shifting. When a model trains on a small dataset, it begins to memorize the patterns in the data rather than learn the features needed to solve the problem. Increasing the size of the dataset by applying augmentation increases the complexity of the data and forces the model to generalize rather than memorize. This reduces overfitting on the training set and improves performance on images that it hasn’t seen before. Augmentation is especially useful in cases where the model may come across objects in variable lighting conditions, positions, and orientations. Applying augmentation to a dataset is done either offline or online. Offline augmentation is applied before training and will create new images in storage with the applied transformations. This enables control over the number of unique images that the model trains on and typically leads to the model converging in fewer epochs. Online augmentation dynamically applies randomized transformations to each image as it is used in training. This means that no extra images are stored and no extra disk space is required. This also enables the model to train on new images continuously as each applied transformation creates a unique image. As a model trains with online augmentation, it may take more epochs to converge because it is continuously seeing new images.

3.3 Applying data augmentation with the TAO Toolkit The TAO Toolkit supports both online and offline data augmentation. You perform offline augmentation by configuring a spec file and using the command-line interface to generate the images. The configuration gives you control to customize spatial, color, and blurring augmentations. You can also customize online augmentation to specify the range of spatial and color augmentations to be applied while the model is training. The recommended way to use augmentation in the TAO Toolkit is to first apply offline augmentation to increase the size of the dataset. Then, configure training to use online augmentation to further increase the complexity of the dataset. Combining both types of augmentation allows the model to see a large variety of images and leads to better model performance, as we show in the augmentation task results. Figure 5. Original image of PCB Figure 6. Augmented images of PCB The following table shows the different augmentation techniques that are supported by TAO Toolkit. Spatial Offline Online Color Offline Online Rotation Hue Rotation Flip Saturation Shift Translation Contrast Shear Brightness Zoom Color Shift Blur Table 2. Shows all the possible spatial and color augmentations in TAO