A Basic Introduction To Research and Study Design For Machine Learning
Full disclosure: I have a hard sciences background. I am not a hard sciences researcher. I have no PhD and do not publish research which adds to the body of knowledge. I am qualified, and often do, SUPPORT hard sciences research.
I am an Applied Machine Learning Researcher. The “Applied” can be looked at as an asterisk. It means I use research methodologies to publish artifacts with business, real world, or some tangible applications and utility.
This may feel like a minor distinction but there is a colossal gap between hard sciences research and applied research.
Much of what’s currently available covers experimental execution and management. The design side is covered from a purely Data Science or Hard Sciences perspective. This post connects the dots between the two. I won’t rehash what’s well covered.
The point is to frame scientific design patterns in terms of what Data Scientists do. We use those design patterns daily without explicitly understand them. Being intentional and aware of our design pattern decisions results in a more mature process and more reliable work products.
The Purpose of Research and Studies
Simply put, research and studies answer a Data Science question. Selecting an appropriate pattern is based on the following feasibility criteria:
Type of Question
Availability of Data
Access to the System Under Measurement
Reliability Requirements
Expected Returns (Revenue or Cost Savings)
Experimental Costs
Experiment Duration
When I talk about the feasibility study in the Research Lifecycle, this is a high level framework for that phase.
Unlike hard sciences, Applied Machine Learning Research can take shortcuts using lower quality design patterns based on reliability requirements. We build risk mitigation frameworks to support the flaws and weaknesses in our selected design patterns.
Research creates a structured, rigorous framework for creating and gathering data about variables of interest. The objective is to understand the relationship or lack of relationship between those variables. Typically, we look at a treatment of some sort and its impact on an outcome.
In simplistic system terms, we control the starting state and limit changes to a specific treatment. The end state of the system or outcome is measured, and the experiment creates a process for connecting treatment to outcome. (Please note, I am using treatment loosely. Treatment, exposure, and intervention all have more rigorous definitions in the hard sciences research lexicon.)
In our raw datasets, there are almost always multiple treatments in play. The design pattern is a way for us to gather data in the cleanest way possible, eliminating as much noise as is feasible. Examining the relationship between a feature and outcome, without noise, is complex but not impossible.
In our model training, the connection between features and inference is unproven. The learned function is a transformation of the data, not necessarily a reflection of the system under measurement. The design pattern creates the most rigorous framework feasible to measure the effects of a treatment on an outcome.
During the results review, methods like counterfactual analysis can be used to assess experimental validity. A successful experiment, even well designed, may not be a valid method to assess the relationship between features and outcomes.
In Data Science, here’s where we often take shortcuts. There may be other explanations for our results, but we can publish artifacts even in the presence of counterfactuals. We publish artifacts once they meet reliability requirements using a combination of evidentiary support and risk mitigate frameworks.
Our research methodology is complete but does not always meet the standards of hard science research. For this reason, I often say, “Machine Learning does not work. It functions.” Working requires a higher level of scientific rigor than is usually feasible.
Observational Versus Experimental (Statistical Experiments Versus Scientific Experiments)
Observational studies and research are focused on gathered data. There is no formal experiment conducted. We don’t do anything except to use a model to observe the data and propose relationships between features and inference.
You will hear me say, “Your trained model is a hypothesis.” This is what I mean. Models are exceptional hypothesis generators. The strength of the model and quality of training data provides weak support for the hypothesis.
Most modeling results in a descriptive or non-analytical model. These models learn a function based on the data, not based on an attempt to test a hypothesis in a rigorous way. In hard sciences research, non-analytical observational studies build support for a hypothesis being worth testing. It can be used to justify performing an experiment.
Analytical studies and research test a hypothesis with the objective of supporting the relationship between features and outcomes. Analytical studies create evidentiary support for a non-analytical model. Analytical studies and research can be observational. This is the primary direction of Data Science experiments using a more rigorous design pattern.
When there is an experiment conducted, data is CREATED by an experimental pattern, we move from the hard sciences definition of observational to experimental. I differentiate the two from a Machine Learning perspective using statistical and scientific experiments. We use statistics to examine data created by uncontrolled processes. (That process should be intentionally selected based on a design pattern.) We use science to create data using controlled processes dictated by the design pattern.
When I talk about research creating novel datasets, I am referring to data created by experimental design patterns. Data Scientists rarely have access, and it is rarely cost effective even when we do, to conduct experiments directly on the system under measurement.
We usually get tangential datasets or can do tangential experiments to generate them. Think of early extrasolar planetary discovery. Scientists did not have the capabilities to directly observe planets around other stars. They found stars wobbling in their datasets. They proposed a hypothesis: stars wobbled because of the gravitational impacts of planets.
They supported their hypothesis with descriptive observational data. Alternative explanations were proposed. Now there could be analytical observational experiments conducted. Alternatives were refuted until the hypothesis had sufficient evidentiary support to be accepted, but not rigorously proven.
This is what Data Scientists do. The number of alternatives explored depends on the reliability requirements, risk tolerance, and our risk mitigation frameworks. We rarely deploy a non-analytical model but also rarely explore all alternative models.
One of the reasons automation is so critical is the ability to explore and compare alternative models. Without automation, it is exceptionally time consuming and expensive, to the point of being infeasible, to do a comprehensive study of alternatives.
For science, the risk of being wrong drives further experimentation until they get it right. For engineering, applied sciences, we only must be right within certain tolerances which are defined by the business’s reliability requirements.
Research and Study Design Patterns
Action Research
Before and After Study
Case Control Study
Case Reports/Study/Series
Causal
Cluster Randomized Trial
Cohort
Cost Benefit/Effectiveness Study
Cross Sectional Study
Descriptive
Diagnostic/Validity/Reliability
Experimental
Exploratory
Historical Design
Longitudinal Design
Magnitude of Effect
Meta Analysis
Non-Controlled
Non-Randomized Controlled
Observational
Philosophical
Practice Guide
Randomized Controlled Trial
Randomized Crossover Trial
Sequential
Systemic Review
Time Series
Trend
A quick Google search for each pattern will return a detailed explanation for further learning. The gold standard in my opinion is the medical field. Broadly, the biological sciences provide the best definitions and frameworks for reliable research.
Physics provides the best case studies for research conducted without direct access to the system under measurement. You will find several now confirmed theories which were initially built using observational design patterns and mathematical proofs. These create a conceptual blueprint to frame many of our experiments.
You also see the level of creativity involved in experimental design. When the easiest designs are removed, we must come up with experiments which are feasible and reliable but not obvious.
At the beginning of every project, intentionally select your design pattern. Choose the highest reliability pattern which is feasible. Write up the risks in advance by describing the weaknesses of the design pattern. You are defining the irremovable risk for this project based on your chosen design pattern. Create high level risk mitigation frameworks to address those risks.
Review this with stakeholders. Does this meet reliability tolerances? Some projects die here. The best possible method may not be capable of producing a model which functions within tolerances. Most projects are feasible and can move forward with a high level of confidence that they will result in an artifact with business value.
Using the design pattern, design the experiment. The pattern you select and available data typically make the experimental design details obvious. You have a lot of latitude in experimental design so document each choice you make for review.
Each choice can, more accurate to say will, introduce bias. Bias is not always a deal breaker. Have someone else review your experimental design for validity. Ask them to check your biases and think about ways your choices could invalidate the experiment.
Think about the fundamentals of your experiment. In the best case scenario, will the results actually support the hypothesis or generate a quality hypothesis in the first place? Using a design pattern does not guarantee a valid experiment with respect to your desired outcomes.
Hierarchy of Studies (Reliability and Evidentiary Support Quality, Risk Discovery)
Tier 1
Cluster Randomized Trial
Randomized Controlled Trial
Randomized Crossover Trial
Tier 2
Cohort
Tier 3
Case Control
Diagnostic/Validity/Reliability
Non-Randomized Controlled Trial
Non-Randomized Crossover Trial
Time Series
Tier 4
Before and After
Case Reports/Study/Series
Cross Sectional Study
Non-Controlled Trial
Trend
Everything else can be considered scientifically unreliable. However, from an Applied Machine Learning Research perspective, they create a level of evidentiary support. Again, we see shortcuts which are available to our field which are unacceptable in the hard sciences.
What we do makes scientists cringe while making businesses several million dollars. The model will fail but those failures fall outside of acceptable tolerances. No one builds a bridge to survive for 2000 years. Regular inspection and maintenance is our risk mitigation framework for bridge longevity. The same principles of defining tolerances, selecting methods to meet those tolerances, and implementing rational risk mitigation frameworks applies to Data Science.
We also reveal the weakness behind most model development methodologies. Models are descriptive of the data not the relationship between features and outcomes. They should not be relied upon for predictive or prescriptive value until evidentiary support equals or exceeds reliability requirements.
Most basic Machine Learning modeling methodologies produce over glorified curve fitting, but theater with business value is still a viable work product. If we advertise the risks in a thorough manner and the business is willing to accept those risks, the model can be deployed.
This is the difference in maturity. There must be some exploration of risk. Those risks must be publicized and reviewed by the business. Those risk must have monitoring and mitigation frameworks in production. The Data Science team must work to continually reduce risk where it adds value.
Conducting Data Science experiments is a well-covered topic. I won’t rehash what’s been done exceptionally by others unless you really want me to.
Create your profile
Only paid subscribers can comment on this post
Check your email
For your security, we need to re-authenticate you.
Click the link we sent to , or click here to sign in.