A de minimis incentive was given to thank the reviewer for their time. The incentive was not used to bias or drive a particular response, nor was the incentive contingent on a positive endorsement. More Info
Verified User
Employee in Engineering (1001-5000 employees employees)
Use Cases and Deployment Scope
We use Apache Spark on a daily basis as the main computation engine for updating most critical and non-critical data pipelines. We mostly work with batch processing but there are instances for using Spark Streaming as well. The scope is for all analysis pipelines, machine learning datasets and several operational use cases.
Pros
Parallel processing
Configurability
Usage with other tools
Cons
More ready-to-use solutions for tweaking the Apache Spark configs
Reduce the creation of UDFs for Pyspark by implementing transformations directly
Return on Investment
Increased data literacy and adherence to best data engineering practices across the organization
Increased ability for the data analysts to quickly and reliably have access to their data, better supporting data driven decisions
Decreased costs due to better parallelization of resources
A de minimis incentive was given to thank the reviewer for their time. The incentive was not used to bias or drive a particular response, nor was the incentive contingent on a positive endorsement. More Info
Assistant Professor in Engineering at The National Institute of Engineering, Mysuru (501-1000 employees employees)
Use Cases and Deployment Scope
If you are working on large and big scale data with analytics - don't go further without the use of Apache Spark! One of the projects that I was involved in using Apache Spark was a Recommendation Systems based project. My area or domain of research expertise is also Recommendation Systems. The deployment of a RecSys along with the use of Apache Spark - functionalities like scalability, flexibility of using various data sources along with fault-tolerant systems - are very easy. The built-in machine learning library MLlib is a boon to work. We don't require any other libraries.
Pros
Fault-tolerant systems: in most cases, no node fails. If it fails - the processing still continues.
Scalable to any extent.
Has built-in machine learning library called - MLlib
Very flexible - data from various data sources can be used. Usage with HDFS is very easy
Cons
Its fully not backward compatible.
It is memory-consuming for heavy and large workloads and datasets
Support for advanced analytics is not available - MLlib has minimalistic analytics.
Deployment is a complex task for beginners.
Most Important Features
Scalability
We had data across multiple sources. Integration with those data source types was not a problem
Generation of recommendations was achievable easily
Return on Investment
We used Apache Spark for one of the research projects. The ROI though cannot be measured here - but the research paper got accepted to a good conference. What else would a project require??!!
Staff Engineer in Information Technology at Nagarro (10,001+ employees employees)
Use Cases and Deployment Scope
Earlier we were using RDBMS like Oracle for retail and eCommerce data. We faced challenges such as cost, performance, and a huge amount of transactions coming in. After a lot of critical issues we migrated to delta lake. Now, we are using Apache Spark Streaming to deal with all real-time transactions. For batch data as well, we are pretty much handling TBs of data using Apache Spark.
Pros
Realtime data processing
Interactive Analysis of data
Trigger Event Detection
Cons
Machine Learning
GraphX Lib
True Realtime Streaming
Most Important Features
Fast Processing
In-Memory Computing
Provides better insights
Return on Investment
No investment as it is open source
Cheap commodity hardwares can save lot of money
Alternatives Considered
Apache Hadoop, SAP HANA Cloud and Apache Ignite
Other Software Used
SAP HANA Cloud, Apache Hive, Apache Airflow, Apache Kafka, Tableau Server, Tableau Desktop
Senior Software Developer (Consultant) in Information Technology at Morgan Stanley (10,001+ employees employees)
Use Cases and Deployment Scope
We need to calculate risk-weighted assets (RWA) daily and monthly for different positions the bank holds on a T+1 basis. The volume of calculations is large: more than millions of records per day with very complicated formulas and algorithms. In our applications/projects, we used Scala and Apache Spark clusters to load all data we needed for calculation and implemented complicated formulas and algorithms via its DataFrame or DataSet from the Apache Spark platform.
Without adopting the Apache Spark cluster, it would be pretty hard for us to implement such a big system to handle a large volume of data calculations daily. After this system was successfully deployed into PROD, we've been able to provide capital risk control reports to regulation/compliance controllers in different regions in this global financial world.
Pros
DataFrame as a distributed collection of data: easy for developers to implement algorithms and formulas.
Calculation in-memory.
Cluster to distribute large data of calculation.
Cons
It would be great if Apache Spark could provide a native database to manage all file info of saved parquet.
Most Important Features
The speed of processing a large volume of data.
Dataframe with SQL-like operations reduces the learning curve for new developers if they do have very good knowledge of databases and SQL.
Cluster to scale up/down easily.
Return on Investment
With the daily risk reports being calculated via Apache Spark, the bank is able to comply with the FHC rule in the US and other regions and control capitals much better with counterparties.
A de minimis incentive was given to thank the reviewer for their time. The incentive was not used to bias or drive a particular response, nor was the incentive contingent on a positive endorsement. More Info