For this example, let’s say we have a text dataset that needs to be classified as either positive or negative feedback. 2. Next, create an Estimator using the Hugging Face DLCs and specify your input data location (e.g., S3 bucket) and output data location (e.g., another S3 bucket). The Estimator will handle all the heavy lifting for us, including downloading necessary models and tokenizers from Hugging Face, setting up our training environment on AWS, and running our fine-tuning script. 3. Start a SageMaker training job using the fit function in Python. This will kick off the actual training process, which can take anywhere from a few hours to several days depending on the size of your dataset and complexity of your model. 4. Once your training is complete, you can use the Hugging Face Inference Toolkit for SageMaker to deploy your fine-tuned model as an inference endpoint on Amazon SageMaker. This allows us to easily serve predictions using our trained model without having to worry about managing any infrastructure ourselves. 5. Finally, test out your new inference endpoint by sending it some sample data and seeing what kind of results you get back. If everything is working correctly, we should see accurate classifications for both positive and negative feedback! By using Hugging Face DLCs with SageMaker’s automatic model tuning feature, we can optimize our training hyperparameters and increase the accuracy of our models. Additionally, by deploying our trained models in Amazon SageMaker, we benefit from cost-effective pricing (training instances are only live for the duration of your job), built-in automation (SageMaker automatically stores training metadata and logs in a serverless managed metastore and fully manages I/O operations with S3 for your datasets, checkpoints, and model artifacts), multiple security mechanisms (encryption at rest, in transit, Virtual Private Cloud connectivity, and Identity and Access Management to secure your data and code), and easy tracking and comparison of experiments and training artifacts in SageMaker Studio’s web-based integrated development environment.
In terms of speedups, the Flash Attention 2 version of the model using two different sequence lengths can provide significant improvements over pure inference time when compared to the native implementation in transformers using facebook/opt-350m checkpoint. This is due to the fact that Flash Attention 2 uses a more efficient algorithm for computing attention scores, which results in faster execution times and lower memory usage.
For example, according to the expected speedup diagram provided earlier, we can expect to see up to a 4x improvement in pure inference time when using the Flash Attention 2 version of the model with a sequence length of 1024 compared to the native implementation in transformers using facebook/opt-350m checkpoint. This is due to the fact that Flash Attention 2 uses a more efficient algorithm for computing attention scores, which results in faster execution times and lower memory usage.
Overall, by leveraging Hugging Face DLCs with SageMaker’s automatic model tuning feature, we can optimize our training hyperparameters and increase the accuracy of our models while also achieving significant speedups during pure inference time using Flash Attention 2.