How To Pick The Right Career In The Data World
Data Scientist, Data Analyst, or Data Engineer? How do you know which one is right for you?
Office Hours
Data Scientist, Data Analyst, or Data Engineer? How do you know which one is right for you?
It’s no secret that Data Engineers, Data Scientists, and Data Analysts are popular roles that a lot of people nowadays are eyeing and hoping to get into. PWC describes all three as “among the most sought-after positions in America” while Data Scientist and Data Engineer rank #3 and #8, respectively, among the top 15 emerging jobs on LinkedIn’s 2020 Emerging Jobs Report.
These roles are very closely related to each other; in fact, a lot of companies might even use the terms interchangeably. So if you are thinking about breaking into the data world, choosing the right role might seem like a daunting task that comes with a lot of questions: Can I become a Data Scientist without a Ph.D.? Do I need to know Python or R if I want to be a Data Analyst? As a person who used to work as a Data Scientist, currently working as a Data Analyst, and has worked closely with a lot of Data Engineers in both jobs, I will try to break down the differences for you and point you to the right resources for each.
Overview — High-level Differences and Overlap
If we roughly separate companies into two sides — the engineering side and the business side— we can use the Venn diagram below to illustrate the relation and overlap between the roles and either side of the company. Note that this distinguishment applies to most mid-sized and large companies; however, in small startups, the lines between these roles get blurry; very often the roles could be a blend of all three.
Data Engineer is the closest to the typical engineer role among the three, and the furthest from the business side. Data engineers spend most of their time designing, structuring, building, and maintaining databases. Most companies have data coming from a lot of different sources, internal as well as external; it’s the data engineers’ job to build and maintain data warehouses to make the data easily accessible and usable to the rest of the company. How can different data tables join onto each other? What should be the primary key for each table? These are some of the example decisions that Data Engineers make on the job. Data Engineers occasionally collaborate with the business side of the company to define the structure of tables since business teams are often the end-users of a lot of the tables Data Engineers build.
Data Scientist is probably the most well-known and most mentioned job title among the three. A misconception about this role is you have to have a Ph.D. in Machine Learning or a similar field. This is true for a subgroup of the Data Scientists, the ones who focus on modeling and algorithms. These Data Scientists (~30% among all Data Scientists in a company) are usually from very strong and highly quantitative academic backgrounds and have extensive theoretical knowledge and practical experience in advanced ML topics. However, the majority (~70%) of Data Scientists are from more diverse backgrounds. They spend most of their time carrying out AB tests and analytics related to different business metrics; the models they build will be likely for demand forecasting or ad hoc analysis instead of reinforcement learning or deep neural nets. This article will be referring to the latter group when mentioning Data Scientists.
Data Analyst is used interchangeably with Data Scientist in a lot of companies as both groups work closely with metrics and ad hoc analyses. If a distinction has to be made, it’s probably that data analysts work more on business interpretations and visualization of the metrics while Data Scientists spend a lot of time doing statistical analyses about them.
Overlaps are common among the three roles. Everyone who has worked on Data Science projects knows that usually ~80% of the time is spent on data cleaning and the regression or classification in the end only takes ~20%, if not less. That’s why it’s so important for data scientists and data analysts to work closely with data engineers; they can save anyone from bad data by structuring and cleaning the data upfront before it goes into tables.
To better illustrate how these three roles work together, imagine a company that wants to roll out an AB test for a new feature on their app; Data Scientists will lead the effort of sizing the experiment and deciding how to split the control and test group; Data Engineers will set up the database in the background to make sure that when the AB test is launched, user activity and events are being recorded and the data flows through to the database in the proper format and structure. After the experiment, Data Scientists and Data Analysts will perform statistical analyses on the results of the AB test and drill down on some of the metrics they care about as well as build visualizations for reporting purposes.
Technical requirements
Some degree of coding skill is a must for all three, but exactly what programming language and what analytics platform is a must for each?
Data Engineers are experts in different data warehouses and cloud computing platforms, as well as how to build Extract/Transform/Load (ETL) data pipelines. They work with AWS, Google Cloud, Snowflake, and many other tools for their day-to-day work. Data Engineers are familiar with SQL and Python, and some are good at C++ and Java.
Data Scientists possess deep statistical knowledge and are no strangers to SQL, R, and Python. A good Data Scientist also knows, at a high level, some basic machine learning algorithms in theory as well as how to apply them.
Data Analysts are pros in SQL and have a practical level of statistical knowledge. They know how to quickly translate business questions into analytical ones and utilize tools like Tableau and Looker to build good visualizations.
Other important skills
Kowing how to Google. Seriously, know what and how to Google. You will inevitably get stuck and when you do, Google and StackOverflow are your friends.
Learning on the job. This is somewhat tied to the last point. A lot of people learn on the job by Googling or talking to colleagues across the company. Every company has different databases and tools, different data cultures (not always perfect ones), workflows, and best practices; so being open and able to constantly learn on the job is essential for anyone in the data organization of companies.
Stakeholder management. All analytical efforts will eventually be used to drive business decisions. So explaining analytical results and concepts to business stakeholders and tying them back to business outcomes is an important part of data talents’ job description. Good data talents are the ones that have enough analytics knowledge and at the same time possess business acumen.
So, how do you know what role to go after?
To answer this, there are two separate factors at play: 1. What do you want to do? and 2. What roles do you qualify for based on your current skill set and experience?
What do you want to do?
The three roles discussed in this article have varying degrees of exposure to the business side, which means they require different levels of stakeholder management. Or put less abstractly, more human interaction and more Zoom meetings, which a lot of us analytical introverts dread. But on the flip side, more business exposure also means more tangible/visible impact and exposure to decision-makers.
Maybe this mental experiment will help: Think back to our AB testing example; would you be the most content and satisfied if you carry out thorough statistical analyses to account for bias in the test-control groups or build models to avoid network effect polluting the AB testing results (Data Scientists)? Or when you spend weeks coding data pipelines and debugging but finally get to watch the data flow into the neatly structured database like magic (Data Engineers)? Or maybe when you closely monitor the metrics you helped the business define and know your visualizations helped drive the decision when you hear your friends talking about the new feature they enjoy so much in the app (Data Analysts).
Which role do you qualify for?
Or, which role do you have time to build up the skills for? Disregarding factors you can’t change or make up for like years of experience (and thankfully most data-related roles don’t have a strict requirement for background or major in school), most of the differences between the job description and your resume can be made up by online classes and interview prep (I will be writing a separate post about how to prepare for interviews for those roles soon, so stay tuned!). Several weeks of online classes in SQL and basic R/Python and practice of McKinsey-style case studies will get you a foot in the door of data analyst roles by helping you pass the technical screening and business acumen portion of the interview. But if you want to be equipped with enough programming and ETL knowledge for Data Engineer roles, or if you want to become an expert in statistical and modeling areas for more advanced Data Scientist roles, it could take you months or years.
But the good news is, most companies make transferring between different data roles extremely easy; and due to how transferable the data skill set is, you can almost never pigeonhole yourself into the wrong career path. So… if you really have no idea which role you want, start with any data-related role, try it out, and pivot like an early-stage startup.