How many days did it take to train GPT-3? Is training a neural net model a parallelizable task?

Trying to read the GPT-3 document is my goal.

What was the duration of training for the GPT-3 model? According to the preceding chart, 3640 days of training were required for the GPT-3. That amounts to 9.97 years. Do I have this right?

If so, how was the model trained for a business that was founded five years ago? Is it possible to parallelize the training of a neural net model so that it may be done on multiple GPUs at once and the training time shortened? Training, or weight optimization, cannot, in my opinion, be done in parallel since each weight must be optimized gradually and step-by-step during each back-propagation. The only way for each weight to reach its ideal value is to gradually alter it in a consecutive manner. Therefore, the task cannot be parallelized. Do I have this right?

What do the tokens in this table mean?

1 Like

The exact training duration of GPT-3 is undisclosed, but it involved a significant amount of computation. While 3640 days would be about 9.97 years, this isn’t directly related to GPT-3’s training time.

1 Like

It is possible that while changing a specific set of weights is not parallelizable, the calculations leading up to the updating are. Many processes must be done during a single transfer from one group of hidden data to the next.

Aside from the actual update, it appears to me (although keep in mind that I have not developed any such code) that they could all be done at once. When all of those processes are complete and the increments are ready to apply to the applicable weights, the updates can be completed in a single block.

I believe that would still result in the majority of the calculations being done in parallel.