For every data analysis project I work on, I utilize a structured methodology to ensure that there is a systematic process to get to insight rather than stumbling my way to it. In addition, for my day-to-day work I’ve adapted the Cookie Cutter Data Science Repo to fit with this methodology. Marrying a structured methodology with a structured project folder with pre-made templates has led to four major benefits.
First, its increased the speed in which I complete projects. Knowing the order of a project and questions that prompt insight in a predictable way allows more room for creative thinking. Instead of being marred in the details and minutiae that come with analysis work, more capacity is left for the creativity.
Second, it allows others to quickly jump in and see how you came to your answers or contribute. This is further made easier if they’ve adopted the same methodology and project template structure.
Third, much code from project-to-project is redundant and its a waist of time to manually build it for every project. I’ve customized the Cookie Cutter repo to fit within this framework. Check it out here. I’ve adapted this repo to the needs of the work I do but you could easily customize it to make it work for you. This could mean adding new folders, taking some away, or converting the Notebooks to R.
Fourth, using this approach makes jumping back into the project months later very easy. If you’re like me you might fear going back and trying to reproduce results of old projects. This makes doing so much easier and less panic inducing.
Six Steps in the Methodology
The following six steps is how I work through an analysis project. As I work through a project it won’t be perfectly linear. There will be bouncing forward and backwards as you I progress through the steps.
1. Questions, Significance, and Understanding
The first step define what we are looking to answer. In addition, we need to ascertain if it merits analysis and how this will bring value to the business. This is an important part of the process because there is an infinitesimal amount of data to derive insight from and diligence should be exercised to maximize value to the business.
- What business problem are we looking to solve?
- How will this project bring value to the business?
- Can data analysis and/or science contribute to solving this problem?
Often there will be plenty of brainstorming to get to get to the root of what needs insights need to be gleaned. For example, lets say you work for a subscription SaaS company and are tasked with helping reduce churn. What helps me is first framing it in how it will solve a business problem that aligns with a Mission or Objective of the business. For example:
Business Problem Recurring revenue is the lifeblood of a SaaS business model and account termination (churn) can have devastating downstream financial impact.
Proposed Solution 1 Each month, determine which accounts are at risk of leaving. We can do this by building a machine learning classification model.
Proposed Solution 2 We don’t want to just ‘save’ customers at the brink of leaving by identifying them early. We want to solve the fundamental reasons on why they are leaving. The proposed solution is to to also uncover three changes we can make that will solve these fundamental problems.
Being able to tie back to a business problem we can now create specific questions that will guide us throughout the project. For example, this question would be helpful for the first proposed solution:
What data features do we think are correlated with a customer who has churned?
2. Gather and Wrangle Data
Once the project passes the proposal stage and we have know the problem, solution, and questions, we can start in on the analysis portion. The first major step is to gather, clean, munge, and organize the data needed for analysis. This process can easily take up 90% of the total time for the project.
- Do we have data that will answer this question?
- Where did we get the data?
- How was the data sampled and is it representative without bias?
- How was the data obtained?
- Do we have access to the data?
- If we are performing supervised modeling is the target variable well defined?
- Have we cleaned the data and achieved data integrity before we begin to explore, visualize, and model the data?
- How will we combine the data if its coming from different sources?
- How will we tidy the data to get it ready for step 3?
3. Exploration, Analysis, and Visualization
Depending on the scope of the project, this may be one of the last steps if the task is simple. A data science or analysis project can generally be categorized into three types:
- Predictive: Are we looking at historical data to predict what will happen or categorize?
- Descriptive: Are we looking to understand the basic features of the data?
This step is the descriptive category. The goal is simply to understand the data. If your project entails finding simple summary statistics or trends then we’ll finish that here.
- View data distributions.
- Identify skewed predictors.
- Identify outliers.
- Calculate summary statistics
- Find trends.
- Uncover interesting insights.
Another popular visualization seeking mantra is:
- Overview: Gain and overview of the entire collection
- Zoom: Zoom in on items of interest (drilling into the data).
- Filter: Filter out uninteresting items.
- Details-on-Demand: Select and item or group and get details when needed.
If the project involves supervised or Unsupervised machine learning this step is integral to developing and influencing model design in step 4.
If the project is to predict or infer then we’ll utilize machine learning techniques.
This method of modeling can be classified into two main areas. The goal is to take past data to predict or classify future data.
- Classification: A classification problem is when the output variable is a category.
- Regression: A regression problem is when the output variable is a real value.
This method of modeling is to find hidden structures in the data.
5. Interpretation & Evaluation
By this point we’ve munged the data, visualized it, explored it, and modeled it. Now its time to take stock of what happened and decide on what to do with what was learned.
- Do the results make sense?
- How you would communicate the results to others?
- What was learned?
- If we are going to continue this work, what next steps you would take with this project?
- Has someone with domain-knowledge validated the findings?
- For clustering, how will the clusters be translated into being useful for the business??
Communicating results is an integral and often overlooked aspect of analysis projects. One of the core tenants of this job function is to derive insight from data so the stakeholders can make business decisions. The scope of the project will generally dictate how to communicate the analysis. Generally, it will fall into one of these categories.
- Slack message
- Raw data