AWS Machine Learning: Understanding Data Sources for Machine Learning

AWS has a wide variety of data sources, and understanding the ins and outs of each one is vital to execute your machine learning projects successfully. Failing to understand the impact on training time, project efficiency, and cost of each data source can lead to poorly optimized solutions and wasted resources.

In this blog post, we will explore four of the most common data sources: Amazon S3, Amazon EFS, Amazon EBS, and Amazon FSx for Lustre. Each of them has its unique performance characteristics, which we will discuss in detail. We will rank the data sources by latency of access, from slowest to fastest. And then I will share my insights based on my experience architecting machine learning projects.

Importance of Choosing the Right Data Source

One of the first things when you start working on a Machine Learning project in AWS, it’s key to choose the right data source that aligns with the performance and efficiency requirements for that specific use-case. The main things to consider are the throughput needed, the latency, scalability in terms of data volume, and the type of data (the decision can be quite different if you are training of really big images or small text files when training an LLM for example).

In most machine learning projects you need to parallelize the training across many nodes, there’s a high volume of data, and you want to make sure the training takes as little time as possible as you’re using expensive compute instances.

Overview of AWS Data Sources

There are a lot of different types of data stores that you might have to interact with during a Machine Learning project in AWS. Once you reach the point in the project where you have to decide what data source you’re going to use, the primary consideration is the volume of data and the performance the data store can provide. I will now explain the four most common examples ranked by the average baseline performance in my own experience.

Amazon S3

Amazon S3 is one of the main building blocks of AWS, and it’s infinitely scalable. Since this blog post is focused on Machine Learning, one of the most prominent use-cases is using S3 as a Data Lake, but it’s also common to use it as the source for machine learning training where there are other data pipelines dropping data into the S3 bucket.

If we look at the performance, it will scale to really high request rates while keeping a latency between 100-200ms. You can always call in parallel (keeping in mind that they need to be different prefixes, as there’s a hard limit on 5500 GET requests per second by prefix).

S3 also has a really low cost for large volumes of data or infrequent access, making it a really cost-effective solution, and that I believe should be the default for any company starting their machine learning journey.

There’s also a new option called Amazon S3 Express One Zone, where the data would be stored in a single AZ, which delivers up to 10x faster access speed and lower costs. If you’re, for example, using SageMaker and starting up a cluster in one specific AZ, you can create a bucket in that same zone and take advantage of this feature.

TIP

The default mode for SageMaker is “file mode”, which downloads the training data to the local directory. You should consider using Fast File mode which provides file system access to s3 while still leveraging the performance advantage of pipe mode (that streams data).

Amazon Elastic File System

Amazon EFS is a network file system that can be accessed at the same time by many EC2 instances or SageMaker instances. By default it is highly available (deployed in multi-AZ) and scalable. It can scale up to a Petabyte scale, and provide more than 10GB/s throughput.

I have seen EFS used in many machine learning workflows where the team needed shared data access, and it was convenient to have the training data for the different projects in the same network drive.

Something to consider is that EFS is not that cost-effective for high throughput, and the latency can be quite variable unless you choose (and pay a premium) the Max I/O option, which can only be turned on when creating the volume.

Amazon Elastic Block Store

Amazon EBS is a block device that lives in the AWS network, and you can attach to your instances to persist data. An EBS volume is tied to one specific availability zone, so like in the case of the S3 one-zone, you need to ensure that the SageMaker or ML jobs are running on the same AZ.

If you’re using the new generation GP3, you can increase the IO independently. It has excellent performance and latency, so it’s not a bad option if you need to store your training data with low latency. A big downside to keep in mind is that you need to manage the volume creation, formatting, and then connecting it to your training instances, so it has higher operational overhead.

You need to keep in mind that EBS is not supported by default on SageMaker, while you can run it on your own EC2 instances, it doesn’t really integrate well with other ML services. So I only would use this for experimenting with a low data volume that needs high latency at the same time.

Amazon FSx for Lustre

Amazon FSx for Lustre is a fully managed distributed file system that was designed for really large-scale computing. It can scale more than 100 GB/s, millions of IOPS all while keeping sub ms and stable latencies. You can choose SSD (for low latency or small file operations) or HDD (for large and sequential).

The biggest advantage, and that is not widely known, is that it integrates with S3 seamlessly. So you can use S3 as your data lake, and then create a FSx Lustre file system that goes “in front” and caches the objects. It can also read and write back any output files to s3.

If you have a large dataset, need a really high performance, and already have the data in an S3 data lake, don’t think twice about choosing Amazon FSx for Lustre as your data source.

Real-world example: Training a classification model using S3 with SageMaker

I worked on a project using SageMaker Autopilot to train a binary classification where we used S3 as the backend data store for training. While I can’t disclose the project, a binary classification model like this can be used for example in medical diagnoses to be able to discern if a person has a speciifc disease or not. For example, the dataset could contain X-rays image like the NIH Chest X-rays dataset.

Using S3 for this project allowed us to qucikly parallelize the training across many jobs, while Sage Maker autopilot automatically inferred the best type of model for our usecase, and evaluate all the different hyperparameters and algortihms to pick the best one. We didn’t have to worry about the scalability or the performance in this case, it was really quick to set up (within a few hours) and the predictions reached the target accuracy.

Conclusion

In this post we covered some of the main options for data sources you might use in machine learning projects in AWS. It is a difficult decision where you have to balance performance, data volume, and cost considerations; always keeping in mind the future scalability, flexibility, and operational overhead.

What I see the most in the projects that I’ve been involved in, most cases S3 is the primary default option as it might already be the company data lake solution. EFS is used for cases where the team collaborates in a single file system, and EBS for specific examples. Teams that are training big models, use FSx for Lustre tied into the S3 data lake, and take advantage of the high-performance and low operational overhead of the solution.

If you have any questions or comments, please feel free to reach out to me on the contact form or schedule a call with me.