S3: Getting data in and out

S3, the object store from AWS, offers practically unlimited, extremely reliable and cost-effective storage for a vast array of assets, making it a core AWS service.

But how do you get your data in and out of S3 efficiently? We’ll look at some tools and techniques you can use to seamlessly transfer, synchronize, and keep your S3 buckets up-to-date.

The S3 CLI

The command line interface gives us two main ways to move data between S3 and a destination: copy and sync.
Other methods include move, remove and follow a similar pattern.
The full API Documentation is available in the CLI docs, here.

Copy vs Sync

The cp (copy) command is straightforward—it moves files from a source to a destination on a one-off basis. This is ideal for when you need to move specific files or when the source and destination are not expected to stay in sync.

aws cp sync /path/to/local/directory s3://your-bucket-name/

The sync command, however, updates a destination to match a source by copying only the files that have changed. This command reduces transfer time and cost by avoiding unnecessary data movement.

aws s3 sync /path/to/local/directory s3://your-bucket-name/

Recursion:

A distinction between copy and sync is recursion. The sync command is recursive by default. To achieve this with the cp command it is necessary to use the --recursive flag.

Including and Excluding Files

Both copy and sync commands support options to include or exclude files, using the --include and --exclude flags. You can specify full filenames or glob patterns, and provide the parameters multiple times for flexibility.

Transferring Metadata

Particularly useful when your files are being served by CloudFront, there are a variety of options that allow you to set protected metadata on the S3 object, for example, content-type and encoding. These will then be sent as headers along with the objects when requested by CloudFront. Utilizing the --metadata option in your commands ensures that these details are correctly assigned and retained during transfers.

Delete

If you’re syncing your latest build, you may want to remove any files at the destination that are no longer present in the source. This can be done using the --delete flag in the sync command, which ensures that your destination bucket only contains files that match the current source.

Dry-Run

Before executing a potentially large or disruptive operation, it’s wise to use the --dry-run option. This flag allows you to see what changes would be made without actually applying them, providing a safety net for fine-tuning commands.

Other Parameters for aws s3 sync

The aws s3 sync command offers a variety of other parameters for fine-grained control over the synchronization process. Some options of particular interest include:

--exclude-after and --exclude-size: These flags allow you to exclude files based on their modification time or size, helping you focus on specific data subsets during syncing.
--preserve-symlinks: This option ensures that symbolic links in the source are preserved in the destination during syncing.
--filter: This advanced option allows you to define user-defined filters based on object attributes like size or last modified date, providing granular control over what gets synced.

Moving Data Between Buckets

In the Same Account

It’s straightforward to sync between buckets in the same AWS account using the CLI, using bucket S3Uri for both source and destination.

Across Accounts

Syncing between buckets located in different accounts is also possible, though slightly more complex. Assuming the CLI is authenticating with account A, then the bucket located in account B will require a policy allowing the principal user of account A access.
The bucket policy and requirements are covered in depth in the aws docs.

Simulating rsync

While not an exact replica, aws s3 sync can efficiently synchronize directories between a source and destination, copying only the files that have changed or been added.

To achieve rsync-like behavior, combine the --delete, --include and--exclude flags discussed above, to achieve similar control over what gets transferred, similar to rsync.

aws s3 sync /path/to/local/directory s3://your-bucket-name/ --delete --exclude "*.log" --include "*.txt"

This command will synchronize the contents of the /path/to/local/directory with the S3 bucket your-bucket-name. It will only copy .txt files and exclude any files with the .log extension, mimicking rsync behaviour.

Syncing with Other Sources

The CLI offers a simple way to sync between an S3 bucket and either another bucket or a local filesystem.

A more flexible way is using DataSync. This service is designed to simplify moving large volumes of data between AWS services or from on-premise storage systems to AWS.

You configure DataSync with a source and a destination, but the options extend beyond your filesystem or S3. You can choose other AWS storage services like EFS or FSx, or even on-premise drives or storage solutions from other cloud providers. DataSync handles the scheduling and automation of the data transfer process, making it an excellent choice for regular data migration tasks that are resource-intensive and require high bandwidth.

DataSync jobs can be synchronized, for example, using eventBridge Scheduler.

Mounting S3 as a filesystem

Mounting S3 as a filesystem, although attractive for simplifying access, was not traditionally recommended due to performance considerations and limitations. However, there are options available:

Third-party tools like s3fs-fuse, have been around for a while and offer a way to mount S3 as a local filesystem. This can be useful for specific use cases, but be aware of potential performance implications.

The introduction of MountPoint for S3 in March 2023 is promising, it provides a native solution from AWS for mounting s3 to linux and suggests that mounting S3 as a filesystem might become a more mainstream approach in the future.

Security Considerations

However you decide to move your data, security is paramount!
Here are some key considerations to keep in mind when transferring data with S3.

IAM Permissions: Ensure the IAM user or role executing the aws s3 sync command has the appropriate permissions to access both the source and destination buckets. This includes read access for the source and write access for the destination.
Least Privilege: Follow the principle of least privilege and grant only the minimum permissions required for the sync operation. Avoid using IAM users with overly broad permissions for syncing tasks.
Encryption: Consider encrypting data at rest and in transit for added security. You can use AWS Key Management Service (KMS) to manage encryption keys.
Versioning: Enable versioning on your S3 buckets, especially for critical data. This allows you to recover previous versions of objects in case of accidental deletion or modification during syncing.
Audit Logging: Enable CloudTrail logging for your S3 buckets to track API calls, including sync operations. This provides an audit trail for troubleshooting and security purposes.

By following these security best practices, you can minimize the risk of unauthorized access or data breaches during S3 data synchronization.

Conclusion

AWS offers a powerful toolkit for managing data movement to and from S3 buckets. Here’s a quick recap:

Simple transfers: Use the aws s3 cp command for one-time file transfers.
Keeping things in sync: Leverage the aws s3 sync command to efficiently update a destination bucket to match a source.
Large-scale data migration: Utilize AWS DataSync for automated and efficient data transfers between various sources and destinations, including S3 buckets.
Future considerations: Keep an eye on the development of AWS MountPoint for S3, which might simplify S3 access through filesystem mounting.
Keep it safe: Whatever method you choose, keep it as secure as possible.

By understanding these options and their use cases, you can choose the most appropriate method for managing your data in S3.