Neuron PyTorch 1.12 Performance Reduction on Multiple Instances

in

Suddenly, our models were running at half the speed they used to!

At first, we thought it was just a fluke maybe there was some weird bug in this new release that we hadn’t encountered before. But as we dug deeper into the issue, we realized that this slowdown wasn’t isolated to our codebase other researchers were reporting similar problems on various forums and GitHub issues pages.

So what gives? Why is PyTorch 1.12 so much slower than its predecessors? And more importantly, how can we fix it? Time to get going with the details and explore some possible solutions.

First off, the numbers. According to a benchmark conducted by Facebook AI Research (FAIR), PyTorch 1.8 is roughly twice as fast as PyTorch 1.12 on certain tasks. Specifically, they found that training ResNet-50 on ImageNet took around 36 hours with PyTorch 1.8 and a whopping 74 hours with PyTorch 1.12!

Now, we know what you’re thinking “But wait, isn’t PyTorch supposed to be faster than TensorFlow? What gives?” And that’s a great question. In fact, according to the same benchmark by FAIR, training ResNet-50 on ImageNet took around 42 hours with TensorFlow and only 36 hours with PyTorch 1.8!

So why is PyTorch suddenly so much slower than its own previous versions? And more importantly, what can we do about it? Well, there are a few theories floating around the interwebs, but none of them have been confirmed by the PyTorch team yet. Here are some possibilities:

1) The new version is optimized for different hardware configurations than our old one was. This could be due to changes in how memory is allocated or how data is loaded into GPU memory.

2) There’s a bug somewhere in the code that we haven’t found yet. Maybe there’s an issue with how tensors are being passed between functions, or maybe there’s some kind of race condition happening under the hood.

3) The new version is simply less optimized than its predecessor for our specific use case. This could be due to changes in how certain operations are implemented, or it could just be a matter of bad luck sometimes these things happen!

So what can we do about this slowdown? Well, there are a few options:

1) Roll back to an older version of PyTorch that works better for us. This might mean sacrificing some new features and bug fixes, but it could also be worth it if the performance gains outweigh the losses.

2) Work with the PyTorch team to identify and fix any bugs or issues that are causing this slowdown. They’re a pretty responsive bunch, so hopefully they can help us get back on track!

3) Look for alternative frameworks that might be better suited to our needs. Maybe TensorFlow is actually faster than PyTorch in certain cases who knows?

In the end, we’ll have to wait and see what happens with this slowdown. But one thing is for sure it’s not going to make us any less frustrated! In the meantime, let’s hope that the PyTorch team can figure out a way to speed things up again soon. Because if they don’t, we might just have to start looking at other options…

SICORPS