Things you should know when traveling via the Big Data Engineering hype-train

Maybe you want to join the Big Data world? Or maybe you are already there and want to validate your knowledge? Or maybe you just want to know what Big Data Engineers do and what skills they use? If so, you may find the following article quite useful.

Why ?

Half a year ago I left BigData world after 3 years of creating various data pipelines. Now I would like to share the knowledge I accumulated in this period before it becomes completely obsolete. This article tries to point out the things that I would like you to know if you were to become my coworker. These were also the things I asked candidates for Big Data Engineer position in my previous company.

I will try to not only point out relevant questions but also give brief answers to some of them. It’s important to remember that these answers are very far from comprehensive and to really understand the topic you need to research it on your own.

Don’t be afraid if you don’t know answers to some of these questions. Complete understanding of these topics is a requirement for a senior developer. For mid or junior positions the subset is enough.

Analysis vs Science vs Engineering

It’s a very popular question: what is the difference between data science, data analysis and data engineering. It’s also quite difficult to answer this question. Here is my understanding:

  • Data Analyst - a person who looks at the data and tries to answer some business questions based on it. He typically works with business intelligence tools or just MS Excel.
  • Data Scientist - a person who plays with raw data, joins various data sources, do basic data processing, feeds data into machine learning models, tunes these models and in general, tries to get as much value from raw data as possible. He typically uses tools such as Python, R, Matlab, tensorflow and similar.
  • Data Engineer - a software developer specialized in data transformation. His job is to take the data and make it useful to the business, but also make it in a way that will be maintainable in a long term. This broad description covers fetching data from various sources (ELK, Splunk, Hadoop, Kafka, Rest Apis, object storage and everything else), transforming it and serving to fulfil the business needs. He often works closely with data scientists to know how the data pipeline should look like, which ML model to use and so on.

For more detailed comparison as well as introduction of “Machine Learning Engineer” you can check out this awesone article.

Following paragraphs describe the knowledge needed by data engineers but it’s also relevant to data scientists as they use similar toolset but in a different way.

Programming language

To efficiently develop data pipelines you need to be proficient with at least one of the following languages: Scala, Java, Python.

The choice really depends on your previous experience and on your workplace preference. If you’re wondering which one to learn right now I would go with Scala. It’s the main language for Spark development, has the biggest number of examples and is the most expressive. Because of it’s functional and strongly-typed nature its also the safest of them.

Regardless of your chosen language, you have to have a basic understanding of the JVM, as it is the most popular platform for data processing. You will need to be able to diagnose and debug problems occurring on the JVM. Here is a couple of things to know:

  • How would you handle OutOfMemeryExceptions being thrown?
  • What is a garbage collector? What garbage collectors are available and what are the differences between them?
  • How to use a profiler to find performance bottlenecks?
  • How to connect debugger to a remote JVM?

Hadoop ecosystem

Apache Hadoop is the most basic toolset for BigData Engineer. You don’t need to know all the caveats of each of the tools but it is important to have a brief understanding of how they work, so you can communicate effectively and know what to study when needed.

  • What is Apache Hadoop? - It’s set of a few different tools, not a single thing you can use. Most popular parts of this ecosystem are HDFS, Yarn, Hive, HBase and Oozie.
  • What is HDFS? - It’s a distributed filesystem used to store data across the cluster.
    • What is a namenode and a datanode?
    • What is a replication factor?
    • What is a block size? How it affects performance and app development?
    • What security mechanisms can we use to restrict data access on the cluster? What is Kerberos?
  • What is Yarn? - It’s a resource manager responsible for assigning memory and cores to computing jobs.
    • What are schedulers and what schedulers are available by default?
    • What are queues?
  • What is Hive? - It’s a SQL layer on top of HDFS.
    • What is the connection between hive, hiveserver2, metastore and beeline.
    • What is the difference between hive, pig and Impala?
    • What are UDFs?
  • What is HBase? - It’s a NoSQL database built on top of HDFS.
    • How it compares to Cassandra?
    • Hot it fits into CAP theorem?
  • What is Oozie? - It’s a workflow manager used to create data pipelines
    • What are alternatives to it?
  • What serialization formats do you know? - The most popular ones are Avro, orc, parquet, json and Hadoop object and text files.
    • What is a column-based format? How it differs from row-based format?

Apache Spark

Apache Spark has become a go-to tool when doing data processing. It’s good to have some experience with it but it’s not critical. For me, it’s just another tool that exposes some API that you will have to learn. An experienced developer has learned hundreds of APIs and should not have a problem with learning yet another one. Still, there are few things worth knowing about it, even without real experience.

  • What are actions and transformations?
  • What deployment modes are available?
  • How is serialization and deserialization handled when crossing JVM boundaries? How it differs between RDD, Dataframe and Dataset APIs?
  • What are shuffles? When they occur and how they affect performance?
  • How do you do the stateful processing? Both in batch and streaming.
  • What are Spark alternatives? - it’s probably the most important questions. You should have a basic understanding of MapReduce, Flink, Kafka Streams, Cascading/Scalding as well as not-clustered data processing tools like bash or some simple libraries(e.g. fs2 or Akka Streams).

Non-hadoop tools

Sometimes its even more important to know the tools not directly connected to BigData. One of the most critical skills of data engineer is to know when NOT to use BigData and go with much simpler tooling.

  • What is Apache Kafka?
    • Kafka has become the go-to data streaming solution. You will hit it sooner or later.
    • What is a topic?
    • What is an offset?
  • What is Redis?
    • It’s good to know at least on in-memory DB. Once you created your final dataset you need to know how to use it and in many cases, you will need to serve it fast.
  • What is ELK stack?
    • Sonner or later you will need some search engine or dashboarding solution. Elasticsearch and Kibana are a good start.
  • What is your experience with bash?
    • In many cases you will need a tolling that will allow you to play with your dataset in most flexible and least time-consuming way. You should be fluent with tools such as grep, cat, jq or awk that will allow you to filter, modify and aggregate your data without any overhead. They are not suitable for production usages but are perfect for experimentation and finding answers quickly.
    • What good practices do you know when creating bash scripts?
    • These area of expertise can be boosted with knowledge of other scripting technologies like python, ruby or ammonite.
  • What are probabilistic data structures?
    • When the volume of the data exceeds certain amounts its no longer feasible to make calculations with exact precisions. In such cases, you should know some basic probabilistic data structures such as a bloom filter or hyperloglog that will calculate things such as cardinality or set membership with controlled error rate.
  • What SQL databases do you know? What are their limitations?
    • It’s good to be familiar with at least one good old sql database. I can bet at least 30% of big data projects could just use postgres and throw all their spark/hadoop code away.

Summary

This list is not comprehensive but it’s most probably not possible to create a comprehensive one. The exact requirements of your potential employer may differ from this list significantly, but being able answer those questions will give you a good base for future development.

If someone would like to complete this article with answers to those questions or add more topics (inline or via links) feel free to open a PR through GitHub repo.

comments powered by Disqus