For the past few months, I have posts several topics on data science, however none of them are geared for the average Joe. I think that in the spirit of learning from the basics, it is only fitting that I dedicate this post in lay terms for those who want to understand data science definition without delving into any technicalities about it.
I am certain that a lot have been written on data science definition all over the web, but I hope that my writing will enrich anyone’s understanding of the topic in addition to other materials. Please note that this post is written in the most basic way possible without compromising the basic understanding of data science. As such, certain generalizations are made to assist comprehension and understanding. However, please feel free to share any opinions in the comment section; I welcome all discussions on the matter.
A Rose By Any Other Name…
Data science, business intelligence, analytics, big data, statistics – is that all? Maybe I miss a few other related terms. Perhaps you have heard of these terminologies several times. But what do they really mean? Just do several searches on the difference between these terms – like this one; you will find a variety of articles and discussions explaining where their boundaries are and how they overlap, each one probably differ from one another in certain ways.
The truth is, all of them are correct. In part, due to there being no definite authorities that define these terms. As a result, there are no defined boundaries that limit each of the terms. Furthermore, some of these terms may be natural evolution of each other or buzzword created to increase hype in order to sell products to the market. The fact that some people support some terms while negating others also did not help matters.
I welcome all these differences. For me, they are all like multiple mirrors in different forms and shapes, but reflecting the same image nonetheless. They may have different names, looks, and the way people react to them; but they all preach the same message.
I use the term data science now and for the foreseeable future primarily because it is arguably the most talked about term. At the moment, it is easily recognizable by people outside the domain knowledge. As a result, it is easier to use when communicating to non-data scientist, and they will be more readily accepting of data scientists’ view.
I am of the opinion that if there is ONE word to describe the goal of data science, it is communication. The ability to communicate knowledge and information to anyone, to make them understand the message behind any analysis in the most efficient manner, to allow relevant people to take action based on the way information is presented. All of that require communication; most of them, need to be expressed in lay terms.
All that being said, I probably will not help the debate by adding my own take on what data science is in this post. Nevertheless, my goal is not to argue differences and semantics, it is to make data science easily understandable in lay terms.
Data Science Definition
Let us now look at several definitions of data science from around the web. Since data scientist is a term often used in conjunction with data science, I will include them as well. Some quotes are not directly stated as definitions, but it can be implied that they refer to the workings of data science. These quotes, titles, and company names are true at the date and time of this post.
- Jeff Hammerback, Chief Scientist, Cloudera (formerly Data Manager at Facebook): “… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.” – on his data science group at Facebook. It is also worth noting that Jeff and DJ Patil coined the term data scientist.
- Mic Farris, Data Science and Analytics Leader, Areté Associates: “Data science is the general analysis of the creation of data. This means the comprehensive understanding of where data comes from, what data represents, and how to turn data into actionable information (something upon which we can base decisions).” – from his blog (23 September 2011).
- Mike Loukides, Vice President of Content Strategy, O’Reilly Radar: “What differentiates data science from statistics is that data science is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.” – from his article (2 June 2010).
- Hal Varian, Chief Economist, Google: “The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.“
- Frank Lo, Founder and Lead Developer, Datajobs.com: “Data science is deep knowledge discovery through data inference and exploration. This discipline often involves using mathematic and algorithmic techniques to solve some of the most analytically complex business problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centers around evidence-based analytical rigor and building robust decision capabilities.” – from blog.
- EMC article, Data Science Revealed: “Data science applies advanced analytical tools and algorithms to generate predictive insights and new product innovations that are a direct result of the data.“
There are other definitions all over the web. Although I try to be thorough, I cannot include every possible incarnation of data science definition in this post. The definitions above are listed here based on their popularity in Google search, and they also represent common themes that exist in all other definitions.
Common Patterns and Themes
Based on these definitions, we can seek out patterns common to all definitions, and use them to come up with a unified definition of data science. The table below extracts excerpts from the quotes and the common theme or pattern from the excerpts.
Quote |
Theme/Pattern |
“…author a multistage processing pipeline in Python…” |
Organizing data |
“…design a hypothesis test…” |
Analysis of data |
“…perform a regression analysis over data samples with R…” |
Analysis of data |
“…design and implement an algorithm for some data-intensive product or service in Hadoop…” |
Analysis of data |
“…communicate the results of our analyses to other members of the organization…” |
Presentation of data |
“…Data science is the general analysis of the creation of data…” |
Analysis of data |
“…comprehensive understanding of where data comes from…” |
Organizing data |
“…what data represents…” |
Analysis of data |
“…how to turn data into actionable information (something upon which we can base decisions)…” |
Analysis of data |
“…data science is a holistic approach…” |
General definition |
“…data scientists are involved with gathering data…” |
Organizing data |
“…massaging it into a tractable form…” |
Organizing data |
“…making it tell its story…” |
Analysis of data |
“…presenting that story to others…” |
Presentation of data |
“The ability to take data…” |
Organizing data |
“…to be able to understand it…” |
Analysis of data |
“…to process it…” |
Analysis of data |
“…to extract value from it…” |
Analysis of data |
“…to visualize it…” |
Presentation of data |
“…to communicate it …” |
Presentation of data |
“Data science is deep knowledge discovery through data inference and exploration…” |
General definition |
“This discipline often involves using mathematic and algorithmic techniques to solve some of the most analytically complex business problems…” |
Analysis of data |
“…leveraging troves of raw information to figure out hidden insight that lies beneath the surface.” |
Analysis of data |
“It centers around evidence-based analytical rigor and building robust decision capabilities.” |
Analysis of data |
“Data science applies advanced analytical tools and algorithms…” |
Analysis of data |
“… to generate predictive insights and new product innovations that are a direct result of the data.” |
Analysis of data |
There are many ways to categorize the themes and patterns, but I have always found it useful to keep things simple when communicating information. As such, the table lists the minimum themes that I was able to come up with without devaluing any of them.
We can see that the three most common theme of data science definition are organization of data, analysis of data, and presentation of data. The brief definition for the themes are as below:
- Organization of data: Gathering data from relevant sources and organizing them in structured models that allow efficient data analysis.
- Analysis of data: Application of statistical and mathematical analysis methods towards data to derive patterns that can be applied to add value.
- Presentation of data: Communication of the result of data analysis using perceptual methods and delivered effectively to generate actions.
Two excerpts are grouped under general definition. One describes data science as a “holistic approach” and the other as “knowledge discovery”. Holistic approach perhaps are used when referring to some opinions that there are boundaries in the approaches of data science-related fields such as statistics, business intelligence, and others. Thus, it is used to indicate that data science encompasses all this. Knowledge discovery is one of the purpose of data science, however I also believe that any data science work needs to be discoverable enough to inspire action from audiences or readers.
Does It Use Scientific Method?
One interesting pattern that I observed in these definitions and many others is that although the word “science” is part of “data science”, strict scientific method, such as the scientific study of data, is not necessarily a required part of data science techniques. What we usually see in data science work involves data analysis to identify patterns that can be applied in the real world.
Nevertheless, I often found that scientific method is a valuable tool in data science works. Questions and hypothesis in the scientific world bear similar analogy in the business world. Businesses want to know how can they leverage their core competencies. Hypothesis may be formed based on past business performance (e.g. certain products in certain markets tend to do better). Data, accurate and precise data, needs to be gathered to formulate a theory that predicts result. The theory needs to be tested in pilot setting and then in the real world. The actual result needs to be analysed to confirm/reject and improve the theory. The result can then be utilized in future data science works to answer other questions.
The concept of scientific knowledge thus eliminates biases and judgmental conclusions, because facts (from data), tested for accuracy and relevancy, are a strong part of the method itself. As such, I believe despite the contrary, I will include scientific method in my definition.
My Definition of Data Science
Based on the diverse definitions of data science and the general patterns of the definitions, I thus offer my definition of data science in simplest words as follow:
Data science is the application of scientific methods on organized data using mathematical analysis to generate valuable result that is communicated using the most effective medium with the goal to generate actions that create value.
I hope this definition has helped you to understand the definition of data science in as lay a term I can muster. In a future article, I will elaborate on the definition by expanding upon the meaning of the phrases in the definition – organized data, mathematical analysis, valuable result, effective medium, and value. Although opinions can be diverse as to the meaning of these phrases, I believe that it encompasses the scope of data science in the simplest terms possible.