Is it possible to perform a batch upload to amazon s3?

Alternatively, you can upload S3 via AWS CLI tool using the sync command.

aws s3 sync local_folder s3://bucket-name

You can use this method to batch upload files to S3 very fast.


To add on to what everyone is saying, if you want your java code (instead of the CLI) to do this without having to put all of the files in a single directory, you can create a list of files to upload and then supply that list to the AWS TransferManager's uploadFileList method.

https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html#uploadFileList-java.lang.String-java.lang.String-java.io.File-java.util.List-


Survey

Is it possible to perform a batch upload to Amazon S3?

Yes*.

Does the S3 API support uploading multiple objects in a single HTTP call?

No.

Explanation

Amazon S3 API doesn't support bulk upload, but awscli supports concurrent (parallel) upload. From the client perspective and bandwidth efficiency these options should perform roughly the same way.

 ────────────────────── time ────────────────────►

    1. Serial
 ------------------
   POST /resource
 ────────────────► POST /resource
   payload_1     └───────────────► POST /resource
                   payload_2     └───────────────►
                                   payload_3
    2. Bulk
 ------------------
   POST /bulk
 ┌────────────┐
 │resources:  │
 │- payload_1 │
 │- payload_2 ├──►
 │- payload_3 │
 └────────────┘

    3. Concurrent
 ------------------
   POST /resource
 ────────────────►
   payload_1

   POST /resource
 ────────────────►
   payload_2

   POST /resource
 ────────────────►
   payload_3

AWS Command Line Interface

Documentation on how can I improve the transfer performance of the sync command for Amazon S3? suggests to increase concurrency in two ways. One of them is this:

To potentially improve performance, you can modify the value of max_concurrent_requests. This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10, and you can increase it to a higher value. However, note the following:

  • Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
  • Too many concurrent requests can overwhelm a system, which might cause connection timeouts or slow the responsiveness of the system. To avoid timeout issues from the AWS CLI, you can try setting the --cli-read-timeout value or the --cli-connect-timeout value to 0.

A script setting max_concurrent_requests and uploading a directory can look like this:

aws configure set s3.max_concurrent_requests 64
aws s3 cp local_path_from s3://remote_path_to --recursive

To give a clue about running more threads consumes more resources, I did a small measurement in a container running aws-cli (using procpath) by uploading a directory with ~550 HTML files (~40 MiB in total, average file size ~72 KiB) to S3. The following chart shows CPU usage, RSS and number of threads of the uploading aws process.

aws s3 cp --recursive, max_concurrent_requests=64


Does the s3 API support uploading multiple objects in a single HTTP call?

No, the S3 PUT operation only supports uploading one object per HTTP request.

You could install S3 Tools on your machine that you want to synchronize with the remote bucket, and run the following command:

s3cmd sync localdirectory s3://bucket/

Then you could place this command in a script and create a scheduled job to run this command each night.

This should do what you want.

The tool performs the file synchronization based on MD5 hashes and filesize, so collision should be rare (if you really want you could just use the "s3cmd put" command to force blind overwriting of objects in your target bucket).

EDIT: Also make sure that you read the documentation on the site I linked for S3 Tools - there are different flags needed for whether you want files deleted locally to be deleted from the bucket or ignored etc.