Estimating Data Science Projects
Data science projects are notoriously difficult to estimate because they do not fit into the traditional software engineering project management paradigm. People without a background in the data field don’t immediately understand the differences, so they don’t see the issue. “It just doesn’t work that way” isn’t an answer the business accepts.
I spent the first 3 years of my data science career working to solve the estimation problem, and I am still improving my approach. This process is by no means perfect, but it works far better than any other framework I have used or seen implemented.
Software engineering has frameworks like CMM (Capability Maturity Model). I have seen six sigma and lean methods applied to software engineering and quality processes. We have agile, SCRUM, Kanban, and good old waterfall. Most technology organizations use some hybrid of two or more approaches. They want the data science team to follow what they have in place, which is where the problems start.
The process dictates the estimation methods, which points to the first problem most data science teams have with estimation. The team has no standard, repeatable process. There is no way to provide consistent estimates until the team has implemented one.
Many data science teams adopt agile or SCRUM because it’s what the software engineering team uses. It works for some parts but not others, which has an odd side effect. The data science team starts to focus on the activities agile (or whatever) can handle, resulting in high-quality code but low-quality models.
What’s going on? Anything that cannot be estimated cannot be managed or planned around. Leadership discourages those types of activities because they don’t deliver on time. Many data science teams’ processes are some version of exploring the available data, training a few different models, selecting the best, and shipping it to production. That’s software engineering with PyTorch, where we pull the logic from datasets instead of coding the logic. The result has the same level of functionality and value.
Splitting Up And Defining A Data Science Process
I split data science into four lifecycles: data, research, model development, and maintenance. Business maturity and capabilities dictate how much of each lifecycle is filled in. Step one in estimation is to document the existing data science process and get everyone to follow it.
If you don’t have a solid starting point, Microsoft’s Team Data Science process is one of the best templates to start with. Amazon has an exceptionally detailed lifecycle too. Here’s an excellent write-up from Neptune AI about several different frameworks and how well they work for data science.
Microsoft’s and Amazon’s frameworks are comprehensive enough to cover an industry innovator’s needs. My entire lifecycle is built for Fortune 100 client needs. Those are all end states and for most businesses, implementing the complete process is 3-5 years out. Long story short, no, you probably don’t need everything, but you are building towards implementing everything to support evolving business needs.
The second step is several improvement iterations. Once the existing process is written down and used, the cracks and flaws are quickly revealed. Version one of most processes works for some projects but not all. Estimations based on early data science process versions are still inaccurate, but at least now, the team can fix what’s broken.
After the team gets to a repeatable process, another split needs to happen. Some parts of each lifecycle are continuous efforts. They never end. Data gathering, data discovery, experimentation, and model maintenance are all continuous. They are also not necessarily attached to a single project.
Improving a model in production continues long after the project ends. Data gathering can be part of the model maintenance and improvement process rather than a project-level initiative. One part of estimation is perpetual resources. They are not assigned to a specific project. They support a continuous process, so they are baked into every project, but not in a typical project planning way.
Model development, data pipeline development, and a few other parts of the data science process fit very well into agile, SCRUM, or other traditional software project estimation frameworks. Those phases need to be defined, and they are ready for straightforward estimations.
Other parts of each lifecycle are research processes. These do not fit into standard software project planning and estimation frameworks. I put them into a gated review process. Most of the research lifecycle is iterative and not guaranteed to produce results. That cannot be part of a product roadmap because it breaks the estimation paradigm.
Managing The Research Lifecycle
Research happens before a traditional project begins. It happens before projects are put on the product roadmap, so it isn’t attached to a single project. There’s no other way to manage the uncertainty.
Data science research phases produce artifacts, typically novel models and datasets. Those artifacts can be used across multiple projects and products. Research is a high-value activity, but not all research cycles produce an artifact. The lifecycle is unreliable enough to fit into any traditional engineering project estimation framework.
Before research begins, the business needs a way to connect the research with business value. If that isn’t part of the process, the research team produces interesting artifacts looking for business applications. There isn’t much value in that process.
The first phases of the research lifecycle are problem and solution space exploration. Research starts with a business problem. Data scientists evaluate those problems to see if data science might provide a solution. If it’s a good candidate, they can explore the solution space or answer the question, “Can we solve this problem with data science?” The final product of the solution space exploration is a feasibility study for the proposed research phase.
A gated review process fits well here. Is this problem worth having a data scientist evaluate? That’s gate one. If they get the green light, phase one starts. The data scientist comes back with a feasibility study. Is there a viable solution, and is it worth the estimated costs to start a research cycle? That’s gate two. There will be several gates where the business can review progress and decide if they want to fund the next phase or another research cycle.
There is no definite end date, just periodic reviews. The project can be ended at any time or extended repeatedly.
At the end of a successful research phase, artifacts are published, and they are ready for a more traditional project estimation structure. Artifacts (data and/or models) can be leveraged across multiple projects, and the uncertainty pieces of the project have been pulled out. Since they are aligned with a business problem space, they will lead to projects aligned with strategic business goals.
The Odd Case Of Infrastructure
Infrastructure projects are another oddity. In my experience, 90% of a business’s data science development automation, model maintenance, internal business automation, data products, and customer product platform needs can all be supported by bought tools. Those are expensive. The remaining needs are customizations based on the business, customers, projects, legacy constraints, or data science team. Custom tool development is also expensive.
Both are very difficult to get budget for because they are big line items. The solution is to slip them into projects.
The research solution space definition should include infrastructure to support research capabilities. Traditional project scoping should include automation, data, product, and maintenance infrastructure.
That’s straightforward, but the problem is the ROI calculation. Unless you’re working on mega-projects, it’s difficult for a single project to support the level of effort required to build the proposed infrastructure. The level of effort should be part of a single project, but the costs should be spread across multiple projects.
With that approach, the ROI calculation works and is obvious. Estimation is part of the budgeting process, and slipping part of the total cost into several projects makes infrastructure an easier sell.