How do I create a Spark dataframe from an S3 pre-signed URL using Databricks? | by Ganesh Chandrasekaran | February 2022

A pre-signed URL is used to grant temporary access to a specific S3 object. This is widely used when the data owner does not want to grant bucket/folder level access to a large number of downstream users.

Here is an example of a pre-signed URL:

A pre-signed URL uses three parameters to limit access to the user.

Bucket: gcbucket

Key: AKRDWVY7VZTAUN42OMQ

Expires: 1643776416

Anyone with a valid pre-signed URL can interact with the objects as specified during creation. For example, if a pre-signed GET (Read) URL is provided, a user cannot use it as a PUT (Write).

The sample script is written using PySpark in Databricks.

Step 1: Using the Query Library, download the contents of the s3 object. In the example above it is friends.csv.gz

2nd step: Save the content to a temporary folder on the Driver node. In this example, it is stored under /tmp of the driver node.

Step 3: Check download

Step 4: Move the downloaded file to the /dbfs system, so that it can be loaded into a Spark data frame.

Step 5: Create a data frame

Step 6: Show Data Block

Step 7: [Optinal] Delete the file from /dbfs folder

put it all together

The AWS S3 pre-signed URL was used for demonstration purposes only. It can be replaced by any other URL allowing the downloading of files via the GET method.

Comments are closed.