There are many areas, processes, and disciplines relating to data science. A cursory search on Google alone would reveal different frameworks, theories, and opinions on wide varieties of topics relating to data science. For the average people, all these diverse topics can be confusing when all they need is just a cursory understanding of what data science is all about. Despite that, in my opinion, majority of the materials, diverse as they are, do share a common pattern and theme when stripped down to the basics. As such, a generalization can be made from the patterns and themes that will enable understanding of data science in lay terms.
The Rule of Three
For that purpose, I shall utilize the “rule of three” to articulate this generalization. Why rule of three? In lay terms, we humans tend to remember things easier in groups of three. Perhaps you have heard of the magical number seven, plus or minus two (7 ± 2), which has been touted as the limit of our brain’s working memory capacity. This research paper by George A. Miller explains the magic number in greater detail. Other research by Steven J. Luck and Edward K. Vogel indicates this number is actually closer to four. In a follow-up research, Edward K. Vogel (same one) and Maro G. Machizawa provide neurological methods that prove the difference in working memory capacity based on individual brains, which may explain why some people can hold more information at a time than others.
Regardless, the rule of three alone is the most famous variation of this theme. This article by Brian Clark explains the engaging nature of this rule, portrayed in various aspects of our lives, including writing, comedy, fairy tales, famous quotes, storytelling, and even the military. It works well in presentations, where audiences are likely to remember only three things from any presentation, so it is wise to group things in three. Even the late Steve Jobs used it. Closer to our topic of data science, even statistics and mathematics have a rule of three.
In short, the rule of three is a very effective method to remember things, and as such, a very useful tool to aid in understanding the basics of data science. I shall present below the three foundations of data science. Any and all topics relating to data science can be linked back to these three foundations. It is my hope that if and when you delve into all things data science, you will be able to relate the different data science knowledge back to these three foundations. I use this as a tool – albeit in a more elaborate form – in my training and consulting services to return my audience’s understanding back to basics, so that at any time, they know what they understand, where they stand, and study the next path in the whole data science domain knowledge.
The foundations serve to simplify understanding of data science. In the words of Henry David Thoreau, “simplicity, simplicity, simplicity“.
Three Principal Areas of Data Science
With the rule of three in mind, I divide the data science areas into three distinct but intrinsically interrelated areas:
- Data Organization
- Data Analysis
- Data Visualization
I shall describe briefly what I mean when referring to these areas below. We shall explore nuances of each area in future posts where I will explore other aspects of the areas such as technical skill requirements, technologies that support the areas, essential aim of the areas, and much more.
For now, to keep things simple, I describe these areas just briefly.
Some would refer data organization as data integration or data structuring. In any case, the aim is more or less always the same. The integration of data from multiple sources, structuring and formatting them using a common standardized data model, and have them accessible in a central location, virtual or otherwise. Some would refer to this type of work as ETL – extract, transform, and load.
An integrated and organized data would be the first step in trying to analyse data. If the data requirement is new, data models can be developed to ensure that the data quality is on par with what the data would be used for in the analysis stage. Existing data may need to undergo some form of transformation if data quality is an issue.
This area requires great deal of technical expertise in areas such as data warehousing and data modelling, among others. Experts would need to be very well versed in the areas of data governance, data storage and archival, and data security, among others. In my opinion, this is one of the most common business offerings that business intelligence (BI) consulting firms often sell to their customers. Granted, historically in the 80’s, executive information system (EIS) and decision support system (DSS) were created with the goal of data organization in mind.
Unfortunately, in this day and age where computing power is at all-time high, some BI firms focus on selling fluffy-looking dashboards while disregarding the core challenges of data organization – good data models and data quality. This often overlooked feature lies in the back-end of BI development should deserve more attention. After all, decisions made using faulty data can be harmful, regardless of how beautiful the charts that present them look like.
Once organized data is available, it can be used to analyse patterns and trends that can be used to describe the data, and even predict outcomes based on them. Companies can use their data to examine how their products perform, by slicing the data into dimensions such as age group and geographic areas, among others. Questions such as “what are the factors influencing my products’ sales?” can be answered by exploring the data. The result can be supported by traditional market research data that can enhance analysts’ quality of work.
Based on the patterns, mathematical models can be developed, using statistical methods, to predict future sales. Interest in predictive analytics is very high currently, with some naming it the next stage of business intelligence. Computers are powerful enough to combine data from multiple sources and run extensive simulations in a fraction of the time it would take to do the same activity ten years ago. A fairly recent real-life example is Nate Silver’s predictions for the United States 2012 presidential elections, where he predicted the probability of Obama or Romney winning the election, using advanced statistical methods, with 100% accuracy.
This area involves technical expertise in statistical methods and mathematical modelling. It also require business domain knowledge in order to know the right questions to be asked when exploring data. An information technology analyst may have a hard time thinking about questions that involves operational financial data, for example, no matter how good the data is. This is why analysts often partner with end-users in order to identify their business questions. That would help them explore the available data in order to identify trends, and possibly develop predictive models from it.
The third area, data visualization, above all else, is about communication. Data needs to be presented to the right audience, in order to get their buy-in on some proposals or ideas. As such, results from the other two areas need to be communicated in the most effective way to their intended audience. Middle managers and top executives most likely need to see different visual of the same underlying data – knowing the audience is only part of the challenge.
Although it would seem trivial, I often find this area to be just as important as the other two. This step is where chart makers need to convince their audience of whatever data they present. It is the area most visible to the audience, and as such, if used effectively, has the most powerful impact in getting audience’s buy-in.
This area require not-so-simple skills such as understanding what chart to use, the colours and fonts of those charts, and the art of storytelling. It involves understanding of the psychology of visual perception and human behaviours, among others. Practitioners of data visualization need more skills than just the traditional computer science and statistics body of knowledge. It requires a touch of art and social skills, skills arguably quite lacking in the traditional business intelligence world. In my opinion, this area is often neglected due to its traditionally non-data science nature and definitely deserves more attention. A favourite example of mine is Hans Rosling’s TED talk on “the best stats you’ve ever seen“.
The three principal areas of data science that I present here is my own theory on the matter. Other models exist out there, and I suggest you to explore these other models as well. In my opinion, the more models you read up on, the better understanding you will have in this ever-evolving world of data science.