Train

Train your AI to perform the cognitive task of your choice.

Overview

Train is the fourth section of the platform. Each training session is composed of the neural network's algorithm version taken from the Library, your Dataset, and the Machine that will perform the neural network training. Inside each training session there are of a set of experiments. Each experiment is an attempt of training your AI on your dataset.

Create a new training session

Go to the Train section of Theos and click the New training button. Write a name for it and click confirm to create your new training session.

Configure your training session

Choose the algorithm

Currently Theos supports all the versions of the YOLOv5 state-of-the-art Object Detector. The extra large version is the most accurate one, but also demands more computational power, and therefore will take more time to train and will take longer perception time when deployed. For blazing fast, real-time speeds, choose a smaller version.

Choose the dataset

Select the dataset you want your AI to learn from.

Choose the machine

If you have one of our professional plans, choose one of your always connected and ready to use Theos Cloud Machines that come with powerful NVIDIA GPUs for lightning fast training. Otherwise, click the + button inside the add machine card to connect your own on-premise GPU machine or click the Use google colab button to use one of google colab's free GPUs.

Set your experiment's training configuration

An Epoch is the act of your AI going through the entire dataset and attempting to predict all the labels you created during the labeling process. The first time this happens, your AI will likely fail to correctly predict the class, position and dimensions of almost all your labels. This is why we must let our AI make many attempts, so it will learn from its mistakes. It is common practice to set a few hundred epochs per experiment. For most cases 300 epochs will be fine for an initial training.

The Batch size is the number of images your AI will predict in parallel, the higher this number, the shorter time each epoch will take to complete, but also the more GPU memory it will require. For Theos Cloud Machines, that come with 16GB of GPU memory. If you happen to be in the free plan, you may need to test this value to don't overload your GPU memory. But don't worry, Theos will let you know about this and let you change it so you can restart your training experiment.

At the end of each completed epoch, the machine will upload to Theos a checkpoint of your AI's current knowledge, in what is called a Weights file. The weights are the representation of the strengths of all the neural connections in the brain of your AI. For each experiment, Theos saves the Last epoch's weights as well as the weights generated in the epoch where the Best performance was achieved (because your AI may reach maximum accuracy at, for example, epoch 185, but start to degrade its performance later due to Overfitting). Later, when you decide to deploy your AI, you will have to choose which weights you want your AI to use.

Finally, you can also set Initial weights if you want your AI to start with the knowledge of a previously trained AI, instead of starting from scratch. This will make it achieve good accuracy in fewer epochs if the previous knowledge is sufficiently transferable to your current dataset.

Start training

Click the Start training button to make your AI learn from your dataset examples.

Wait for the training experiment to finish

Now you are free for a while, you can go grab a cup of coffee or watch a movie, your AI started training and you just have to wait for it to finish.

Monitor training progress and metrics

If you want, you can check the training progress and metrics once in a while. New metric values will stream directly to your browser once per minute of training, so you can monitor your AI learning in real-time.

The main metric to watch is the fitness of your AI. This represents how good your AI is at predicting the class of your labels, as well as their position and dimensions. Its value goes from 0 to 1, and the higher is better. Generally, a good enough object detector requires a fitness of 0.5 or above. This is the value used to determine if a given weight file is the Best one of the experiment. You can safely ignore most other metrics for now, we will talk about them in a future neural network debugging guide.

Resume training

If you happen to be using a machine connected from Google colab, your training may be interrupted because Google shuts colab instances down after a few hours.

To resume your training do the following.

Stop the current training by clicking the Stop training button on the bottom right corner.
Delete the previously connected colab machine.
Connect a new machine.
Create a new experiment by clicking the + button on the top left.
Set as initial weights the Last weights from the previous interrupted experiment.
Set the number of epochs you had left to complete in the previous interrupted experiment.
Click the Start training button to resume your training.

Training has finished

Your AI has finished training. You can now review all the training metrics one more time before deploying your AI into production to test it and finally integrate it with your software.

PreviousOn-Premise NextDeploy

Last updated 1 year ago

Was this helpful?