Spark in AWS: "S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream"

Ignore it. The more recent versions of the AWS SDK always tell you off when you call abort() on the input stream, even when it's what you need to do when moving around a many-GB file. For small files, yes, reading to the EOF is the right thing to do, but with big files, no.

See: SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream

If you see this a lot, and you are working with columnar data formats such as ORC and Parquet, switch the input streams over to random IO over sequential by setting the property fs.s3a.experimental.fadvise to random. This stops it from ever trying to read the whole file, and instead only reading small blocks. Very bad for full file reads (including .gz files), but transforms column IO.

Note, there's a small fix in S3A for Hadoop 3.x on the final close HADOOP-14596. Up to the EMR team whether to backport or not.

+I'll add some text to the S3A troubleshooting docs. The ASF have never shipped a hadoop release with this problem, but if people are playing mix-and-match with the AWS SDK (very brittle), then this may surface

Spark in AWS: "S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream"

Tags:

Hadoop

Hdfs

Apache Spark

Pyspark

Related

Recent Posts