Amazon S3 (Simple Storage Service) great use for a data lake implementation
Use Cases and Deployment Scope
At my organization, we primarily use Amazon S3 (Simple Storage Service) as a catch all storage solution. In particular, we have built our entire data lake around it, so we have “raw layer” buckets, “clean layer” buckets and so on. There are of course other uses, such as simple “data dump” buckets but in general we aim to use Amazon S3 (Simple Storage Service) as our main storage solution.
Pros
- Bucket name uniqueness, as it forces to implement some rudimentary form of naming organization
- Flexibility in the buckets management: policies, version control, etc
- Available APIs: it is possible to interact with Amazon S3 (Simple Storage Service) quite easily thanks to the various APIs to read/write/update the objects
Cons
- UI: it could be a bit more intuitive, especially when there are deleted elements
- Filter on the prefix (partial) name: in a lot of cases, the precise full path and name of the object must be know to find it
- It’s very easy to have too broad policies or completely lock yourself out from a bucket, it would be nice to have some guardrails in place
Return on Investment
- Affordable: the entire data lake and most of our raw data is on Amazon S3 (Simple Storage Service) and it’s not the most expensive feature from AWS we use
- Easy to onboard to: we are aiming for 100% of data being synced to Amazon S3 (Simple Storage Service) in some form, so that data is located in a single place
- Good integration with other systems, reduced overall costs for us and time to reach a decision
Usability
Other Software Used
Apache Airflow, Apache Spark, Amazon Elastic Kubernetes Service (EKS)


