The three most popular careers in data are data science, machine learning, and data engineering. But what exactly do these jobs entail? Jonathan Heyne, VP and General Manager of Data and Engineering Programs at Springboard, joins us to explain how these three jobs work together, the different tools that machine learning engineers, data scientists, and data engineers use on the job, the salaries you can expect in each role, and the education you need in order to get into these careers.
As the VP and GM of Data Programs at Springboard, Jonathan oversees all the aspects of preparing students for a career in data.
Jonathan spends a lot of time understanding the market, where the opportunities lie, how to teach to meet those opportunities, and how to support students along the way.
Data engineering, data science, machine learning engineering, and data analytics all deal with data and some level of programming. They also all require strong analytical thinking and hypothesis-driven thinking skills. This is true whether you’re analysing data, drawing an insight, figuring out the right approach to scale, or building the infrastructure to meet these performance constraints that a system needs.
Let’s say a bank wants to detect credit card fraud.
The data scientist develops a model that theoretically can detect credit card transaction fraud at a bank.
The machine learning engineer would then be responsible for deploying the model in real life and making sure it can handle billions of transactions daily.
The data engineer would be responsible for ensuring that all the transaction data that the bank handles is stored properly. If the system needs to process a million transactions a second, the data engineer would build the data pipelines that can convey all that info at the right time to the right parts of the system without any delays or bottlenecks.
Data Science combines the fields of statistics, machine learning, and programming, with some domain expertise in the operating space. Data scientists use this combination to build models that help companies and organizations draw insights and make predictions from their data. Data scientists use insights they’ve generated to either help with business problems or to develop new modeling techniques to tackle data problems.
Examples of a data scientist’s projects:
A data scientist may be asked to find out which customers are likely to churn vs renew their subscription.
A data scientist introducing new modeling techniques would develop an algorithm that makes facial recognition and tagging in social networks more accurate.
Tools + languages that data scientists use:
We teach Python in our Springboard Data Science Career Track because it is the prominent language in the industry. Many data scientists use data science libraries like pandas and scikit-learn and jupyter notebooks. R is used more for data exploration and modeling.
Machine Learning Engineering (MLE) is the art and science of deploying and managing machine learning models in production. A machine learning engineer takes models (statistical or machine learning) developed by data scientists and turns them into a live production system.
Machine learning engineers are basically software engineers with two additional skill sets:
Enough practical knowledge of machine learning to understand the model, what it takes to scale and deploy this model
Proficiency in specific engineering tools for managing data pipelines and deploying machine learning systems; to be able to monitor and debug them
A general backend engineer doesn't have to know these tools.
Data engineers set up the infrastructure on which the data scientists and machine learning engineers do their work. They are responsible for data storage, data transportation, at the right volume, at the right velocity, for the required usage. Data engineers are primarily software engineers that specialize in data pipelines and ensuring that data flows where, when, and how it's needed for these models to actually work. They don't need to understand the machine learning or statistical models the way data scientists do.
Data scientist creates model prototype
Machine learning engineer uses tools to scale and deploy those into production
Data engineer ensures that the system has what it needs to deliver deployment
Tools + languages that data engineers use:
Statistics as a mathematical concept has been around for thousands of years, but businesses started using statistical analysis for insight in the early 20th century. The modern meaning of data science - the combination of using data, statistics, programming, and business skills - was coined around 2008.
Machine learning engineering became a role in the last five to seven years due to the availability of these large scale machine learning infrastructure systems. This stems from the rapid development and availability of hardware infrastructure for storing and processing massive amounts of data. This meant finally being able to scale those machine learning models into real production.
How do these roles work together in a team setting?
No differently from any cross-functional project that requires teams to work together! These teams feed one another; therefore, it is imperative that data engineers, data scientists, machine learning engineers, data analysts, and the business stakeholders are all in agreement about the project requirements and the constraints as early in the project as possible. Organizations need to determine boundaries between these roles in a way that works for everyone, so there can be clarity about responsibilities.
Tensions arise because of lack of clarity of requirements. Teams need to know:
How accurate the model is/needs to be
What data is needed to enable the ML model to make the predictions
How often the model needs to be retrained
How to test whether the model is working correctly
Types of decisions and alignments that need to happen as early as possible to allow everyone to work together with fewer friction points
Jobs/Roles in Data
The senior roles in these fields are more advanced applications of similar concepts, working with more complicated models and systems. Examples include: machine learning scientists who work on self-driving cars applications; or data architects who work on advanced applications and design systems from scratch. These are roles we typically can’t prepare bootcampers for as beginners, but are roles that their careers could advance to with experience.
How do salaries compare?
All roles are well-compensated career paths with plenty of room for career growth and development. The answer heavily depends on company size, industry, geography, and seniority.
They're all very capable of getting to the national average of a six-figure salary. Machine learning engineers would typically be making a bit more, given the requirements for significant software engineering experience, and would be more senior roles.
Which role is easiest? Which requires the highest degree?
Data engineers and machine learning engineers are more specialized backend developer roles.
Ideally, they need some software engineering experience first. But some data engineers that work with hard and novel computer engineering problems may actually need a computer science background. However, these skills can be learned through experience. There are many examples of data analysts who became successful data engineers without being a software engineer first.
If you're already a software engineer and more interested in diving into the engineering problem of the data world, then data engineering may be your path. If, as an engineer, you’re more curious about the machine learning and the statistics side, you could consider the data science or machine learning engineering role. Some machine learning engineering roles require advanced degrees, but data engineers typically don’t. Data scientists come from a variety of backgrounds.
What does the roadmap into data look like for a complete beginner at Springboard?
We designed the Springboard offering as a ‘school of data’ that can help anyone transition into a career in data. We offer job-guaranteed career tracks in data science, machine learning engineering, data engineering, and data analytics. Regardless of experience or background – whether someone has two years of experience in software engineering, or never written a line of code – there is a course at Springboard that can get them into a data role.
Post-graduation, I’d recommend targeting an entry-level data science role or data analytics role, then begin exploring the world of data and carve their path from there. We know that over 90% of our students are making $26,000 more in their new data jobs than they did before our program.
When preparing students for a career in data, what is the bootcamp's responsibility to teach ethical use of data and of the tools learned?
In our ‘data-driven’ world, there’s a tendency to trust what the data says without being aware of how data models can show the same data in different ways based on how they're built. The potential for abusing data or data applications is massive and the responsibility to stop that rests with everyone in the space. Every educational institution, not just bootcamps, teaching any kind of data-related skill needs to talk about ethics, privacy, and security in these fields. Bootcamps specifically need to incorporate ethical thinking into what we teach and in our projects.
We take ethics in AI and ML very seriously at Springboard, in curriculum and the way capstone projects are designed, assessed, and implemented across all our data courses.
We really get into ethics once students start building projects. For example, a student wants to build a model that predicts crime. The project guidelines and the content in the course should encourage the students to ask themselves questions like:
What data will I use?
Is the data I'm going to use already biased in any way? How will I account for that bias?
What is this model that I'm creating actually learning? Is it basing its predictions on useful information? Or is it replicating, perpetuating, and amplifying existing social biases?
How can someone accidentally or intentionally abuse my model? If the model is abused, what will be the consequences of this? From a business perspective, what will my organization be liable for? What if we build models that are later abused by others?
What are your favorite resources for complete beginners who want to dive into data?
Reddit has a lot of good information and Coursera has a few nice intro to data science courses. I also like two books:
Doing Data Science by Cathy O’Neil is a summary of real life data science case studies that's written in an accessible way.
Designing Data-Intensive Applications by Martin Kleppmann is a good overview of core engineering principles and practices that go into creating reliable and scalable data systems.
For anyone considering a career transition into data, or any other space for that matter, I’d suggest you first consider what level of support and accountability you need in your own process. Can you realistically learn completely on your own or do you need some structure around your learning? If you look at bootcamps like Springboard and others, find which one provides you with a level of rigor, accountability, support and proven outcomes that you’ll need to be successful.
A beginner's guide on how to use Python in cybersecurity with Flatiron School!
Everything you need to know about jQuery!
11 Java jobs that bootcamp graduates can land!