Making Sense of Big Data
Why Apache Spark is well suited for all kinds of ETL workloads.
This is part 2 of a series on data engineering in a big data environment. It will reflect my personal journey of lessons learnt and culminate in the open source tool Flowman I created to take the burden of reimplementing all the boiler plate code over and over again in a couple of projects.
- Part 1: Big Data Engineering — Best Practices
- Part 2: Big Data Engineering — Apache Spark
- Part 3: Big Data Engineering — Declarative Data Flows
- Part 4: Big Data Engineering — Flowman up and running
This series is about building data pipelines with Apache Spark for batch processing. But some aspects are also valid for other frameworks or for stream processing. Eventually I will introduce Flowman, an Apache Spark based application that simplifies the implementation of data pipelines for batch processing.
This second part highlights the reason why Apache Spark is so well suited as a framework for implementing data processing pipelines. There are many other alternatives, especially in the domain of stream processing. But from my point of view when working in a batch world (and there are good reasons to do that, especially if many non-trivial transformations are involved that require a larger amount of history, like grouped aggregations and huge joins) Apache Spark is an almost unrivaled framework that excels specifically in the domain of batch processing.
This article tries to shed some light on the capabilities Spark offers that provides a solid foundation for batch processing.
I already commented in the first part on the typical parts of a data processing pipeline. Let’s just repeat those steps:
- Extraction. Read data from some source system (be it a shared filesystem like HDFS or in an object store like S3 or some database like MySQL or MongoDB)
- Transformation. Apply some transformations like data extraction, filtering, joining or even aggregation.
- Loading. Store the results back again into some target system. Again this can be a shared filesystem, object store or some database.
We can now deduce some requirements of the framework or tool to be used for data engineering by mapping each of these steps to a desired capability — with some additional requirements added to the end.
- Broad range of connectors. We need a framework that is able to read in from a broad range of data sources like files in a distributed file system, records from a relational database or a column store or even a key value store.
- Broad and extensible range of transformations. In order to “apply transformations” the framework should clearly support and implement transformations. Typical transformations are simple column-wise transformations like string operations, filtering, joins, grouped aggregations — all the stuff that is offered by traditional SQL. On top of that the framework should offer a clean and simple API to extend the set of transformations, specifically the column-wise transformations. This is important for implementing custom logic that cannot be implemented with the core functionality.
- Broad range of connectors. Again we need a broad range of connectors for writing the results back into the desired target storage system.
- Extensibility. I already mentioned this in the second requirement above, but I feel this aspect is important enough for an explicit point. Extensibility may not only be limited to the kind of transformations, but it should also include extension points for new input/output formats and new connectors.
- Scalability. Whatever solution is chosen, it should be able to handle an every growing amount of data. First in many scenarios you should be prepared to handle more data than what would fit into RAM. This helps to avoid getting completely stuck by the amount of data. Second you might want to be able to distribute the workload onto multiple machines if the amount of data slows down processing too much.
Apache Spark provides good solutions to all these requirements above. Apache Spark itself is a collection of libraries, a framework for developing custom data processing pipelines. This means that Apache Spark itself is not a full-blown application, but requires you to write programs which contains the transformation logic, while Spark takes care of executing the logic in an efficient way distributed on multiple machines in a cluster.
Spark was initially started at UC Berkeley’s AMPLab in 2009, and open sourced in 2010. Eventually in 2013, the project was donated to the Apache Software Foundation. The project soon caught on traction, especially from people used to work with Hadoop Map Reduce before. Initially Spark offered its core API around so called RDDs (Resilient Distributed Datasets) which provide a much higher level of abstraction in comparison to Hadoop and thereby helped developers to work much more efficiently.
Later on the newer on preferred DataFrame API was added, which implements a relational algebra with an expressiveness comparable to SQL. This API provides concepts very similar to tables in a database with named and strongly typed columns.
While Apache Spark itself is developed in Scala (a mixed functional and object oriented programming language running on the JVM), it provides APIs to write applications using Scala, Java, Python or R. When looking at the official examples, you quickly realize that the API is really expressive and simple.
- Connectors. With Apache Spark only being a processing framework with no built in persistence layer, it always relied on connectivity to storage systems like HDFS, S3 or relational databases via JDBC. This implies that a clean connectivity design was built in from the beginning, specifically with the advent of DataFrames. Nowadays almost every storage or database technology simply needs to provide an adaptor for Apache Spark to be considered as a possible choice on many environments.
- Transformations. The original core library provides the RDD abstraction with many common transformations like filtering, joining and grouped aggregations. But nowadays the newer DataFrame API is to be preferred and provides a huge set of transformations mimicking SQL. This should be enough for most needs.
- Extensibility. New transformations can be easily implemented with so called user defined functions (UDFs), where you only need to provide a small snippet of code working on an individual record or column and Spark wraps it up such that the function can be executed in parallel and distributed in a cluster of computers.
Since Spark has a very high code quality, you can even go down one or two layers and implement new functionality using the internal developers API. This might be a little bit more difficult, but can be very rewarding for those rare cases which cannot be implemented using UDFs.
- Scalability. Spark was designed to be a Big Data tool from the very beginning, and as such it can scale to many hundreds nodes within different types of clusters (Hadoop YARN, Mesos and lately Kubernetes, of course). It can process data much bigger than what would fit into RAM. One very nice aspect is that Spark applications can also run very efficiently on a single node without any cluster infrastructure, which is nice from a developers point of view for testing, but which also enables to use Spark for not-so-huge amounts of data and still benefit from Sparks features and flexibility.
By these four aspects Apache Spark is very well suited to typical data transformation tasks formerly done with dedicated and expensive ETL software from vendors like Talend or Informatica. By using Spark instead, you get all the benefits of a vivid open source community and the freedom of tailoring applications precisely to your needs.
Although Spark was created with huge amounts of data in mind, I would always consider it even for smaller amounts of data simply because of its flexibility and the option to seamlessly grow with the amount of data.
Of course Apache Spark isn’t the only option for implementing data processing pipelines. Software vendors like Informatica and Talend also provide very solid products for people who prefer to buy in into complete eco systems (with all the pros and cons).
But even in the Big Data open source world, there are some projects which could seem to be alternatives at the first glance.
First we still have Hadoop around. But Hadoop actually consists of three components, which have been split up cleanly: First we have the distributed file system HDFS which is capable of storing really huge amounts of data (petabytes to say). Next we have the cluster scheduler YARN for running distributed applications. Finally we have the Map Reduce framework for developing a very specific type of distributed data processing applications. While the first two components HDFS and YARN are still being widely used and deployed (although they feel the pressure from cloud storage and Kubernetes are possible replacements), the Map Reduce framework nowadays simply shouldn’t be used by any project and more. The programming model is much too complicated and writing non-trivial transformations can become really hard. So, yes, HDFS and YARN are fine as infrastructure services (storage and compute) and Spark is well integrated with both.
Other alternatives could be SQL execution engines (without integrated persistence layer) like Hive, Presto, Impala, etc. While these tools often also provide a broad connectivity to different data sources they are all limited to SQL. For one, SQL queries itself can become quite tricky for long chains of transformations with many common table expressions (CTEs). Second it is often more difficult to extend SQL with new features. I wouldn’t say that Spark is better than these tools in general, but I say that Spark is better for data processing pipelines. These tools really shine for querying existing data. But I would not want to use these tools for creating data — that was never their primary scope. On the other hand, while you can use Spark via Spark Thrift Server for executing SQL for serving data, it wasn’t really created for that scenario.
One question I often hear is what programming language should be used for accessing the power of Spark. As I wrote above, Spark out of the box provides bindings for Scala, Java, Python and R —so the question really makes sense.
My advise is either to use Scala or Python (maybe R — I don’t have experience with that) depending on the task. Never use Java (it really feels much more complicated than the clean Scala API), invest some time to learn some basic Scala instead.
Now that leaves us with the question “Python or Scala”.
- If you are doing data engineering (read, transform, store), then I strongly advise to use Scala. First since Scala is a statically typed language, it is actually simpler to write correct programs than with Python. Second whenever you need to implement new functionality not found in Spark, you are better off with the native language of Spark. Although Spark well supports UDFs in Python, you will pay a performance penalty and you cannot dive any deeper. Implementing new connectors or file formats with Python will be very difficult, maybe even impossible.
- If you are doing Data Science (which is not the scope of this article series), then Python is the much better option with all those Python packages like Pandas, SciPy, SciKit Learn, Tensorflow etc.
Except for the different libraries in those two scenarios above, the typical development workflow is also much different: The applications developed by data engineers often run in production every day or even every hour. Data Scientists on the other hand often work interactively with data and some insight is the final deliverable. So production readiness is much more critical for data engineers than for data scientists. And even though many people will disagree, “production readiness” is much harder with Python or any other dynamically typed language.
Now since Apache Spark is such a nice framework for complex data transformations, we can simply start implementing our pipelines. Within a few lines of code, we can instruct Spark to perform all the magic to process our multi terabytes data set into something more accessible.
Wait, not so fast! I did that multiple times in the past for different companies and after some time I found out that many aspects have to be implemented over and over again. While Spark excels at data processing itself, I argued in the first part of this series that robust data engineering is about more than only the processing itself. Logging, monitoring, scheduling, schema management all come to my mind and all these aspects need to be addressed for every serious project.
Those non-functional aspects often require non-trivial code to be written, some of which can be very low level and technical. Therefore Spark is not enough to implement a production quality data pipeline. Since those issues arise independent of the specific project and company, I propose to split up the application into two layers: One top layer containing the business logic encoded in data transformations and the specifications of the data source and data target. One lower layer then should take care of executing the whole data flow, providing relevant logging and monitoring metrics, taking care of schema management.
This was the second part of a series about building robust data pipelines with Apache Spark. We had a strong focus on why Apache Spark is very well suited for replacing traditional ETL tools. Next time I will discuss why another layer of abstraction will help you to focus on business logic instead of technical details.
For data engineers looking to leverage Apache Spark™'s immense growth to build faster and more reliable data pipelines, Databricks is happy to provide The Data Engineer's Guide to Apache Spark. This eBook features excerpts from the larger Definitive Guide to Apache Spark that will be published later this year.How hard is Apache Spark? ›
Is Spark difficult to learn? Learning Spark is not difficult if you have a basic understanding of Python or any programming language, as Spark provides APIs in Java, Python, and Scala. You can take up this Spark Training to learn Spark from industry experts.Is Apache Spark for big data? ›
Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.Do you think Spark is ideally suited for big data analysis? ›
Fast processing – The most important feature of Apache Spark that has made the big data world choose this technology over others is its speed. Big data is characterized by volume, variety, velocity, and veracity which needs to be processed at a higher speed.Is Spark mandatory for data engineer? ›
Technical Skills: Data engineers need to be proficient in programming languages like Python, Java, and SQL. They must also be familiar with big data technologies like Hadoop, Spark, and Kafka. Experience with cloud computing platforms like AWS, Azure, or Google Cloud Platform is also essential.Is Apache Spark going to replace Hadoop? ›
Hadoop excels over Apache Spark in some business applications, but when processing speed and ease of use is taken into account, Apache Spark has its own advantages that make it unique. The most important thing to note is, neither of these two can replace each other.How many days will it take to learn Spark? ›
40 Hours will give you significantly good amount of knowledge what is what & What to learn , What not to learn. Just keep this thing in mind that learning everything at one go not necessary. We can start with "just enough" concept and learn only the things which are necessary for us at the start.What is the salary of Apache Spark engineer? ›
Spark Developer salary in India ranges between ₹ 4.0 Lakhs to ₹ 15.6 Lakhs with an average annual salary of ₹ 6.6 Lakhs.Is Spark easier than Hadoop? ›
Spark works better than Hadoop for Iterative processing. Spark's RDDs allow multiple map operations to be carried out in memory, but MapReduce will have to write the intermediate results to a disk. Due to its faster computational speed, Spark is better for handling real-time processing or immediate insights.Is PySpark enough for big data? ›
PySpark is used to process real-time data with Kafka and Streaming, and this exhibits low latency. PySpark platform is compatible with various programming languages, including Scala, Java, Python, and R. Because of its interoperability, it is the best framework for processing large datasets.
Spark in memory database is a specialized distributed system to speed up data in memory. Integrated with Hadoop and compared with the mechanism provided in the Hadoop MapReduce, Spark provides a 100 times better performance when processing data in the memory and 10 times when placing the data on the disks.When should you not use Spark? ›
- Ingesting data in a publish-subscribe model: In those cases, you have multiple sources and multiple destinations moving millions of data in a short time. ...
- Low computing capacity: The default processing on Apache Spark is in the cluster memory.
After 4 to 5 years, Spark will still be relevant as an engine in platforms like Databricks, EMR, Dataproc, etc but the usage of it will be simplified. Already Data Engineering and BI Developer roles are getting converged into Analytics Engineer role.Why Spark is better than SQL? ›
Extrapolating the average I/O rate across the duration of the tests (Big SQL is 3.2x faster than Spark SQL), then Spark SQL actually reads almost 12x more data than Big SQL, and writes 30x more data.How much does Spark big data engineer earn? ›
The national average salary for a Hadoop and Spark Developer is ₹8,36,155 in India.What skills are required for big data engineer? ›
- Computer programming with languages like C++, Java, and Python.
- Databases and SQL.
- ETL and data warehousing.
- Talend, IBM DataStage, Pentaho, and Informatica.
- Operating system knowledge for Unix, Linux, Windows, and Solaris.
- Apache Spark.
As in other data science roles, coding is a mandatory skill for data engineers. Besides SQL, data engineers use other programming languages for a wide range of tasks.Is there anything better than Apache Spark? ›
It processes large amounts of data with open source tools like Apache Spark, Apache Hive and Apache HBase. EMR allows you to run petabyte-scale analysis at a fraction of the cost of traditional on premises solutions. It is also 3x faster than standard Apache Spark.
The future of Spark is one of major proliferation, where businesses of many types and sizes use it for their own big data purposes. In fact, Apache Spark may become a must-have big data tool that's available through cloud applications, becoming a part of other tools that businesses already use.Is Apache Spark in demand? ›
According to a survey, there is a huge demand for Spark engineers. Today, there are well over 1,000 contributors to the Apache Spark project across 250+ companies worldwide.
You need to learn a framework that allows you to manipulate datasets on top of a distributed processing system, as most data-driven organizations will require you to do so. PySpark is a great place to get started, since its syntax is simple and can be picked up easily if you are already familiar with Python.Is Spark worth learning? ›
The survey revealed that people with Apache Spark skills added $11,000 extra to the median salary, while Scala programming language had an impact of $4000 to the bottom line. Apache Spark developers earn highest average salary among other programmers using 10 of the most prominent Hadoop development tools.Why is my Spark job taking so long? ›
memory values will help determine if the workload requires more or less memory. YARN container memory overhead can also cause Spark applications to slow down because it takes YARN longer to allocate larger pools of memory. What happens is YARN runs every Spark component, like drivers and executors, within containers.Is Apache Spark certification worth IT? ›
It's great at assessing how well you understand not just Data Frame APIs, but also how you make use of them effectively as part of implementing Data Engineering Solutions, which makes Databricks Associate certification incredibly valuable to have and pass.What is the salary of big data Hadoop and Spark Developer? ›
Hadoop and Spark Developer salary in India ranges between ₹ 3.5 Lakhs to ₹ 13.8 Lakhs with an average annual salary of ₹ 6.5 Lakhs.What is the highest 4TH engineer salary? ›
What is the highest salary for a 4TH Engineer in India? Highest salary that a 4TH Engineer can earn is ₹19.9 Lakhs per year (₹1.7L per month).Is Spark best for ETL? ›
Apache Spark provides the framework to up the ETL game. Data pipelines enable organizations to make faster data-driven decisions through automation. They are an integral piece of an effective ETL process because they allow for effective and accurate aggregating of data from multiple sources.Which is better Python or Spark? ›
Scala is considered the best language to use for Apache Spark due to its concise syntax, strong type system, and functional programming features, which allow for efficient and scalable distributed computing. However, Python is also a popular language for Spark due to its ease of use and extensive libraries.What is faster SQL or Spark? ›
MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes. In my examples below, MySQL queries are executed inside Spark and run 5-10 times faster (on top of the same MySQL data). In addition, Spark can add “cluster” level parallelism.What is the max salary in PySpark? ›
Highest salary that a Pyspark Developer can earn is ₹17.0 Lakhs per year (₹1.4L per month).
Coding is required. For working professionals who code: Coding is required in Data Science, and you can pick it up. There is a learning curve in Data Science because, along with code, you will also need to unlearn and relearn mathematics and business.Do data engineers use PySpark? ›
PySpark is a very demanding tool among data engineers.How long will it take to learn Hadoop and Spark? ›
Through self-learning, it can take 3-4 months to learn Hadoop, but by opting for expert training and certifications one can master Hadoop in 2-3 months.Do data scientists use Spark? ›
Data scientists use Spark for many important steps in data science activities like answering data queries for static data with SparkSQL, handling streaming data with good speed due to in-built memory, regression and classification problems with MLlib and visualization tasks with Graph facilities.Why is Spark preferred over Hadoop? ›
Spark has its machine learning library called MLib, whereas Hadoop must be interfaced with an external machine learning library, for example, Apache Mahout. As Spark is faster than Hadoop, it is well capable of handling advanced analytics operations like real-time data processing when compared to Hadoop.What are the 3 major differences between Hadoop and Spark? ›
Performance: Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce. Cost: Hadoop runs at a lower cost since it relies on any disk storage type for data processing.What is Apache Spark bad for? ›
Some of the drawbacks of Apache Spark are there is no support for real-time processing, Problem with small file, no dedicated File management system, Expensive and much more due to these limitations of Apache Spark, industries have started shifting to Apache Flink– 4G of Big Data.What is the maximum data size for Spark? ›
maxPartitionBytes : The maximum number of bytes to pack into a single partition when reading files. Default is 128 MB. spark.Why use Spark instead of Python? ›
Using Spark with Scala allows users to access internal developer APIs of Spark that are not private. Python, on the other hand, can only allow users to access the end-user Spark APIs and provides limited support for the extension of the features provided by Spark.What are the pros and cons of Spark? ›
|Speed||No automatic optimization process|
|Ease of Use||File Management System|
|Advanced Analytics||Fewer Algorithms|
|Dynamic in Nature||Small Files Issue|
Apache Spark has grabbed huge popularity among data scientists because of its high speed. When it comes to large scale data processing, Apache Spark speed is 100 times faster as compared to Hadoop. It has the great ability to manage multiple petabytes of clustered data from over 8000 nodes at a time.Does Netflix use Apache Spark? ›
This talk introduces one such spark-based stratification library developed at Netflix to aid “Training Set Stratification” in offline machine learning workflows.How hard is it to learn Spark? ›
Is Spark difficult to learn? Learning Spark is not difficult if you have a basic understanding of Python or any programming language, as Spark provides APIs in Java, Python, and Scala. You can take up this Spark Training to learn Spark from industry experts.Should I learn Apache Spark or PySpark? ›
PySpark can be classified as a tool in the "Data Science Tools" category, while Apache Spark is grouped under "Big Data Tools". Apache Spark is an open source tool with 22.9K GitHub stars and 19.7K GitHub forks. Here's a link to Apache Spark's open source repository on GitHub.Why Spark is good for big data? ›
Simply put, Spark is a fast and general engine for large-scale data processing. The fast part means that it's faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on memory (RAM), and that makes the processing much faster than on disk drives.Which database is best for Spark? ›
Apache Spark is a powerful processing engine designed for speed, ease of use, and sophisticated analytics. Spark particularly excels when fast performance is required. MongoDB is a popular NoSQL database that enterprises rely on for real-time analytics from their operational data.Why use Spark instead of BigQuery? ›
BigQuery REST API may be used using Java, PHP, Python, or a command-line tool. Apache Spark's description calls it a “fast and generic engine for large-scale data processing.” Spark is compatible with Hadoop data. It can handle HDFS, HBase, Cassandra, Hive, and any other Hadoop InputFormat in YARN or standalone mode.Do data scientists use Apache Spark? ›
Data scientists use Spark for many important steps in data science activities like answering data queries for static data with SparkSQL, handling streaming data with good speed due to in-built memory, regression and classification problems with MLlib and visualization tasks with Graph facilities.What profession uses Apache Spark? ›
Job Positions or Application Areas For Career in Spark
The different job positions are Software Developer, Systems Engineer, System Architect, System analysts, Big data developer, the lead software engineer in big data, data scientist, data engineer, IT project management, Management analyst etc.
Apache Spark provides the framework to up the ETL game. Data pipelines enable organizations to make faster data-driven decisions through automation. They are an integral piece of an effective ETL process because they allow for effective and accurate aggregating of data from multiple sources.
Their job is to get the data into a form where others in the data pipeline, like data scientists, can extract value from the data.What are the downsides of Apache Spark? ›
In-memory capability can become a bottleneck when we want cost-efficient processing of big data as keeping data in memory is quite expensive, the memory consumption is very high, and it is not handled in a user-friendly manner. Apache Spark requires lots of RAM to run in-memory, thus the cost of Spark is quite high.Why not to use Apache Spark? ›
- Ingesting data in a publish-subscribe model: In those cases, you have multiple sources and multiple destinations moving millions of data in a short time. ...
- Low computing capacity: The default processing on Apache Spark is in the cluster memory.
|Annual Salary||Monthly Pay|
The national average salary for a Spark Developer is ₹9,41,201 in India.Is Apache Spark better than Hadoop? ›
Spark has its machine learning library called MLib, whereas Hadoop must be interfaced with an external machine learning library, for example, Apache Mahout. As Spark is faster than Hadoop, it is well capable of handling advanced analytics operations like real-time data processing when compared to Hadoop.Why use Apache Spark over Hadoop? ›
Spark Benefits: Advantages of Spark over Hadoop. It has been found that Spark can run up to 100 times faster in memory and ten times faster on disk than Hadoop's MapReduce. Spark can sort 100 TB of data 3 times faster than Hadoop MapReduce using ten times fewer machines.Which SQL does Spark use? ›
Spark SQL Architecture
Schema RDD − Spark Core is designed with special data structure called RDD. Generally, Spark SQL works on schemas, tables, and records. Therefore, we can use the Schema RDD as temporary table. We can call this Schema RDD as Data Frame.
More technical blog. The demand for big data professionals has never been higher. "Machine Learning Engineers, Data Scientists, and Big Data Engineers rank among the top emerging jobs on LinkedIn," Forbes proclaims. Many people are building high-salary careers working with big data.What is the responsibility of big data engineer? ›
A big data engineer is a professional who is responsible for developing, maintaining, testing, analyzing, and evaluating a company's data. Big data refers to extremely large data sets.