7 traits of the best Data Scientists

Over the course of 10 years of working with data you tend to pick up a few things. You learn (and re-learn) languages and syntax. You make mistakes, some of them bad ones. You get better at spotting bullshit. You also start to observe some patterns, both in the work and in your colleagues.

Seven is an arbitrary number of course but I believe the following seven traits are things I have consistently observed across the best Data Scientists I have had the pleasure of working with. They are not things you can put on a resume. They probably won’t help you in that coding test or job interview. But they are traits which mean they are able to make more meaningful contributions, have a bigger impact, and get along better with their colleagues.

So, given this article is already long enough, let’s skip to the part we are all here for: 7 traits of the best Data Scientists.

1. They look at the data

A tendency I have noted in some Data Scientists is that as they become more confident working with python and/or R, there is an increasing reliance on summary statistics and charts to understand data. Sometimes this starts to happen at the expense of looking at the data itself.

Don’t get me wrong, summary statistics and charts are, of course, important for understanding data. Knowing the distribution of the data, the mean, variance, quartiles (or whatever quantiles you prefer), as well as the skewness and outliers is vital. Additionally, it is important to understand how various features relate to each other via measures of correlation and/or charts such as pairwise plots. However, something that the best Data Scientists know is that if they are really going to understand a dataset, they are going to have to look at the data.

If you have all the summary stats and charts, why do you need to look at the actual data? There are a couple of good reasons.

The first reason is that looking at the data forces you to think about what that data is representing. It leads you to ask questions about the data generating process, e.g. how did this data come to exist? How was it collected? How was it stored? Do these numbers make sense in that context? These could be questions like:

Are the values in the dob column actually dates, hopefully in the past?
Why are missing values being stored as null, “null”, and “”?
Can someone really be active on an app for 20,000 seconds in a day, and how are we measuring that anyway?
Why don’t these IDs all have the same number of characters?

Looking at the data forces you to think about what the data is representing. It changes it from an abstract blob of numbers to something real and tangible.

These questions often help to serve as an important early identifier for data quality issues, but they also build understanding. They add meaning to the summary statistics. Does an average height of 181 cm make sense? What about if it is a sample of people from Guatemala? What about if it is a sample of Guatemalan basketball players?

The second reason to look at the data is what I will call pattern recognition. Human beings are pattern recognition machines, we love looking for and finding patterns in everything around us. Sometimes we love it a little too much and we start seeing patterns in random noise (e.g. horoscopes and technical analysis).

With modern datasets (i.e. “BIG DATA”), there is often too much raw data for us to look at random rows and derive any real meaning; we get overwhelmed. But our pattern recognition can really shine when we start looking at subsets of the data – outliers, duplicates, really bad and/or suspiciously good predictions from a model.

For example, you build a model that performs well on your chosen metric. That’s great! But now let’s look at the predictions that are way off. What do those data points look like? Do they have something in common? In these situations, give yourself a license to dig, to go down the rabbit hole, see what you can find. Maybe you will find a thread to pull and discover your model isn’t so great after all. Or maybe you will find those bad predictions are data points that represent some edge-case or data quality issue that can be adjusted for, making the model even better and more robust.

The best Data Scientists recognize that getting lost down those rabbit holes, at least for a little while, is what builds true understanding of the data you are working with. But the only way you are going to do that is if you look at the data.

TLDR:

People who don't know any statistics just look at the data.

People who know some statistics run hypothesis tests, compute confidence intervals, etc.

People who know lots of statistics just look at the data.
— Josh Wills (@josh_wills) February 12, 2021

2. They choose the right tool for the job

One of the most common arguments in data science (and probably most professions) is over what tools to use. Python or R. PostgreSQL or MySQL. Google Cloud Platform, Microsoft Azure or AWS. Tableau or Power BI (or one of the 137 other BI tools on the market). Lots of tools, even more opinions, what to do? From what I have observed, the best Data Scientists follow these broad steps.

Step 1

Do your research and work out what tools are best suited for your given problem. Need to do some data wrangling? Python or R are the best choices¹. Need to store a tonne of analytics data for analysis, business intelligence and building models? A column-oriented database like Bigquery, Snowflake or Redshift is a good option, but if you know SQL (see observation #4), you will be good to go in all of them. Want to make nice dashboards? Focus on being able to easily import your data, and trust me when I say that what data you choose to display and the choice of chart is far more important than knowing how to use a specific BI tool.

Step 2

Dismiss with prejudice people who say there are black and white answers to these questions. People who say “X is always better than Y” or “Z is garbage, I don’t know why anyone would use it” just aren’t worth listening to. Listen instead to the one who says “it depends”. The real world is a sea of grey, and every tool has its advantages and disadvantages… except for Google Sheets, that is just garbage.

Step 3

Don’t over think it and choose one. Be confident that you will be able to get the job done with any of your shortlisted tools. Know that you can change later if needed and that changing to another tool in the same segment is mostly about learning different syntax and/or finding where an equivalent option is. This especially applies to data manipulation, whether it be in python/pandas, R or another language. I started working in data with Excel and SQL, and taught myself R and pandas primarily through Google searches like “inner join in R” and “pivot table in pandas“.

Step 4

Realize that the real trap is not choosing the “wrong” tool to start with, it is trying to use that tool(s) to solve all problems. This tends to happen over time because the tools we use start to change the way we think about problems. It is the “everything looks like a nail when you have a hammer” cliché. If you only ever work with relational databases, you’ll start to see every dataset as tabular data. If you only ever create dashboards with Tableau, you will start to overlook chart types and options that aren’t available in Tableau.

“We shape our tools and thereafter our tools shape us.”
John M. Culkin²

Summing up, how do the best Data Scientists choose the right tool for the right job? They do their research, gather nuanced opinions, and then don’t overthink it. They know they are not committed for life and that, most importantly, the key is to stay open to learning new tools, not trying to use one tool for all problems.

3. They think outside the black box

Having started working with data before ‘Data Scientist’ was a job title and before the use of machine learning algorithms was common place (outside of academia at least), my first “real” job was all about creating models… in Excel. Some of these models were simple throw-away models that were never really validated or evaluated. But some were complex models used to forecast national tax revenues on a monthly basis looking forward for 5 years. Building and maintaining those models, there were two important lessons I learned:

Non-machine learning models/algorithms can be as simple or complex as you want.
The accuracy (however you are measuring it) of a model is almost always less important than the model being logical, explainable and defensible.

The accuracy of a model is almost always less important than the model being logical, explainable and defensible.

Fast forward 10 years to a world of easily accessible machine learning algorithms and all too often a non-machine learning algorithm/model is often not even a consideration. Why would you not use a machine learning or deep learning algorithm, something that is capable of accurately classifying images and generating entire original articles given a one line prompt, and instead build a glorified pile of if-else statements?

For the vast majority of modelling challenges in the business world, the best Data Scientists know to think outside the machine learning black box.

The thing is, machine learning, thanks to libraries like tensorflow, keras and scikit-learn, is easy. By that I don’t mean the data processing steps of collecting and cleaning data, feature engineering, or the validation process after you make your predictions. Those steps are time consuming and complex, but are also required regardless of how you construct a model. The easy part is the process of actually building the model. You select an algorithm, make some pretty arbitrary choices about the number of hidden layers or the maximum tree depth, then give it some training data. Maybe you do some parameter tuning, if you didn’t automate that part as well.

Developing a custom model, on the other hand, is almost always harder. You have to actually think about what factors might be impacting your dependent variable and what is driving that relationship. You have to consider causality. You need to make assumptions that you will have to explain and defend. You need to build a logical model before you build the actual model. Maybe most importantly, you need to understand the relevant domain.

A custom model represents a concise summary of what you believe is happening. When you build and test this model, you are quantifying your belief of how this little piece of the world works. This can feel risky because you are exposing and testing your theory and you could be completely wrong. What the best Data Scientists realize is that a) using their expertise and experience to make these sort of decisions is why they get paid the big bucks, and b) regardless of whether they are initially right or wrong, they will be uncovering important insights that will improve the model and their understanding.

Machine learning in contrast feels safe. We don’t have to make assumptions, or put forward theories or opinions. If it goes wrong, we can blame the algorithm and/or the data. It is out of our control. When we build a machine learning model, we don’t get an interpretable logical model of the relationships even after we build the model so there is nothing to prove wrong. Sure, there are metrics on feature importance and we can calculate correlation coefficients (see #1), but the primary purpose of machine learning algorithms is to model relationships we believe are too complex for us to understand. The best Data Scientists see that as a drawback, not a feature.

The best Data Scientists also know the other big advantage of a custom model is when it comes to maintenance and updating of the model. Any model, no matter how good, has a limited lifespan. But a model that works on broad principles rather than being fitted to a specific training dataset is likely to be significantly more resilient to drift, which means less maintenance. And if the relationship represented in the model does stop reflecting reality, that is probably something you want to know!

With a machine learning model, on the other hand? Outside of regular maintenance, if the model stops working what are the options? You can feed in some new training data and/or re-weight old training data, try to work out some new features to add, tune some hyperparameters… At the end of the day though, you are essentially left hoping the model fixes itself when you retrain it.

Of course, there are many problems and domains where machine learning is going to be far superior to any algorithm a human can come up with. Computer vision and natural language processing (NLP) are two fields that come to mind where trying to develop a logical model is both practically impossible and of questionable value. Additionally, sometimes the dataset is just too big and complex to have any hope of interpreting it, or perhaps we just don’t care about the interpretation, the result is the only thing that matters.

What this means is that the best Data Scientists know they need to consider what matters in each individual case. Is an extra point of accuracy worth the extra maintenance and the loss of interpretability? For the best Data Scientists, the answer to that question is “no” far more often than it is “yes”. They know the machine learning toolbox should be closer to a last resort than a first option. And they know better than to casually toss out that custom model that someone spent a bunch of time building because it doesn’t use “machine learning”.

4. They are comfortable with SQL

SQL. Structured Query Language. Love it or hate it, what isn’t up for debate is that SQL is everywhere. Countless libraries, databases and dashboards abstract away SQL through nice APIs or GUIs, but underneath it all, there is still SQL and for good reason. The best Data Scientists know that not being comfortable in SQL is a big handicap. Here are the two main reasons why.

The first reason is that not knowing SQL severely hinders your versatility and independence as a Data Scientist. Sure, there are lots of tools and packages that will help you extract data and generate SQL queries for you, but being a Data Scientist and relying on those tools is like being a programmer and relying on “no-code” solutions: the code is still being written, you just don’t know what it looks like. And if you don’t know what the SQL query looks like or how to read it, it’s going to be pretty hard to fix it when it doesn’t work. It also means if you do need to query the database directly because something broke, you won’t be able to.

The second reason is that the database you are using is almost certainly more efficient at processing large datasets (see #2) than whatever language you are using to wrangle the data after you get it out of the database (e.g. R or python). For small datasets, this doesn’t matter so much, but once you start working with datasets over a few gigabytes, the performance difference becomes a big deal. Having the ability to write SQL lets you shift some, and sometimes most, of your data wrangling to that more efficient and powerful database.

Now the good news: Learning SQL is easy, especially if you are already working with data. Almost every operation you are doing in R or pandas or even Excel has an equivalent in SQL. In fact those operations were probably created based on an equivalent SQL function. So spend a little time practicing, write the query yourself instead of relying on a library to do it for you, and then when the day comes that you get a multiple gigabyte dataset or an auto-generated query fail, the situation won’t seem so daunting.

5. They know they work in software development…

People coming into data science often come from two broad paths. One increasingly common path is from a math/statistics background. The best Data Scientists recognize that, outside a few specific research focused positions, they are in a field where they need to write good code and use tools that are used in software development.

To be clear, that means you need to be able to do more than hack together long R/python scripts. It means writing clean code that runs efficiently and reliably. Code that is consistent, structured, organized and compartmentalized. Functions, classes, virtual environments, containerization, unit testing. You will also need to learn to use git version control and how to interact with a bash terminal.

You don’t have to know how to use those things on day one, no one does, but you do need to be open to learning those skills and not expect someone else to do it for you forever. These are the tools of the trade, and I assure you that you will understand why when you get comfortable with them.

The best Data Scientists, rather than seeing this as a burden or a distraction, see it as an opportunity. If you embrace it, you will quickly improve with practice and experience. For many, the improvement is so quick and drastic that their own code from six months ago is deeply embarrassing. It also opens up so many possibilities. Scripts to scrape data from websites or interact with APIs, web applications, programs to organize your desktop or save emails. These are skills you get to keep for life.

6. …but also know the importance of understanding statistics and probability

For those coming from the other path, a software development background, the best Data Scientists know all too well that implementing a machine learning algorithm and optimizing it is not “data science”.

In fact, I strongly suspect this reductionist view of data science is what leads to some of the more silly and dangerous claims about AI, claims like you can predict if someone is high IQ or a terrorist from their face, that you can find the best employees by analyzing how they behave in an interview, or that you can accurately de-pixelate photos. In fact there are so many examples of bad AI implementations there is a popular GitHub repo devoted to keeping track of them.

What has this got to do with statistics and probability? Beyond specific methods, what you tend to learn as part of studying these subjects is how easy it is to be fooled by randomness, and how to look for tell-tale signs that you might be getting fooled. In fact, of the Data Scientists I have met, there has been a strong correlation between years spent working with data and skepticism towards claims made based on data. Measures of uncertainty, knowing the relationship between samples and populations (and which one your data represents), how randomness and noise impacts on models and data. All of these are hard to appreciate coming from a straight software development background.

There is a strong correlation between years spent working with data and skepticism towards claims made based on data.

So where do you start? For learning probability and statistics basics, there is a huge range of free courses online. Get comfortable with the terminology and the concepts. “What is a p value?” and “What does a 95% confidence interval represent?” are not trick questions, and should be something you can answer with confidence. Correlation coefficients, standard errors, random variables and the characteristics of common distributions (e.g. normal, log-normal, exponential, chi-squared, binomial, Poisson) should all be familiar to you, at least to the extent that you know when one might apply and where to find more information. As with the previous section, you don’t need to know these things on day 1, but these topics need to be on your to-do list.

7. They are highly empathetic

I left it for last, but to me it is by far the most important trait of the best Data Scientists I have had the pleasure of working with. They are highly empathetic. But what does that mean? Let’s break it down.

An often underrated soft-skill, empathy is the ability to understand how someone else feels. “Putting yourself in someone else’s shoes” and “reading the room” are both shorthand references to being empathetic. Why is it so important?

The first reason is effective communication. Ever been stuck in a boring conversation where the other person just keeps yapping away even after you put your headphones on and turn back to your screen? Don’t be that guy (and it is almost always a guy).

It starts with the basics: Who you are talking to? What is the setting? Are they in the mood for a joke or is this a serious conversation? What information do they need to know? And perhaps most importantly for data science, what level are they at technically? How can I present this technical topic in such a way that I don’t make my audience feel dumb but also not feel condescended to? Presenting a model to some business people? Maybe don’t go into the details of gradient descent and back propagation. Presenting to your fellow Data Scientists? You can probably skip the slide on how to calculate a weighted average. The best Data Scientists I have met are always extremely effective communicators.

The second reason empathy is important is for getting along with, assisting and perhaps even leading colleagues. Data science is a huge field with an almost endless array of techniques, algorithms, tools and languages, and specialized jargon. No one knows it all, and imposter syndrome is rife in the industry. And yet, as a result a heavy bias towards what I’ll call “hyper-rational STEMlord-ism”, all too often we are too quick to judge and criticize others for perceived short comings. Want to hear about a shortcoming: I went into an interview for a Data Scientist position and couldn’t answer the question “What is data leakage?” It wasn’t that I didn’t understand the concept, I had just never heard it referred to as “data leakage”. I would have just called it “bad modelling”. It was a very short interview.

The best Data Scientists I have known never judged, criticized or mocked. Despite knowing significantly more than others, they worked to bring everyone up to their level. They take the time to show the techniques, explain why they work, teach the jargon. They empathize and know they have been in the position of not knowing something, and will probably be in that position again sometime soon.

Now, think about how much more pleasant this field would be if just a few more of us had that level of empathy.

Wrapping up

So that’s it, that’s my 7 traits of the best Data Scientists. I hope you got something out of this, even if it was just a laugh. Obviously these are my opinions and many will disagree, although not too many I hope. I don’t pretend to have all the answers, in fact, if my recent job interview experiences are anything to go by, I have very few answers. But what I do know is that 10 years later I am still learning and getting better, and enjoying the experience. With that, I will leave you with one final thought:

The day you believe you know everything is the day you stop learning.

[1] Not a comment on the quality of these languages vs other languages, but they are far more commonly used, and have more resources for learning and troubleshooting.

[2] Apparently Winston Churchill coined a similar phrase but Culkin popularized this particular form. Source

7 traits of the best Data Scientists

1. They look at the data

2. They choose the right tool for the job

Step 1

Step 2

Step 3

Step 4

3. They think outside the black box

4. They are comfortable with SQL

5. They know they work in software development…

6. …but also know the importance of understanding statistics and probability

7. They are highly empathetic

Wrapping up

Leave a Reply Cancel reply

Archives

Categories

7 traits of the best Data Scientists

1. They look at the data

2. They choose the right tool for the job

Step 1

Step 2

Step 3

Step 4

3. They think outside the black box

4. They are comfortable with SQL

5. They know they work in software development…

6. …but also know the importance of understanding statistics and probability

7. They are highly empathetic

Wrapping up

Leave a Reply Cancel reply

Archives

Categories

Tags