Experiment Design Guidelines for Product Analysts - Part 1/3

By Elisabeth Reitmayr

At ResearchGate, we run a lot of experiments to improve our product for our users. Our experiment design guidelines for product analysts give guidance on how to set up those experiments from the analytical and statistics perspective to ensure we can evaluate the experiment as intended. This guideline gives some hints but does not fully cover the product management, user research, and design perspective, i.e. what to experiment on. In this post, we will focus on the work that is required before starting an experiment.

This post is the first part of a series in which we publish some of our internal guidelines and frameworks to make the way we work more transparent. We are interested in your feedback on these guidelines — please send it to [email protected]

Objectives of the experiment

Is an experiment the best method?

Experiments are a very powerful tool in the methodological repertoire of a product analyst because they allow us to causally infer from a treatment (product change) to an effect. This is much stronger evidence than correlation analysis for example, which does not allow us to draw causal conclusions. So why don't we just run experiments for everything? Experiments are expensive, they require a lot of preparation and monitoring from product management, user research, product analytics, and design teams, and eventually engineering time for implementation and resolution. They also come with opportunity cost: we only have a limited amount of traffic and time to experiment, and we should make sure we use it for the most impactful changes and innovations. Therefore, we should choose the assumptions and hypotheses we experiment on carefully based on previous insights.

As suggested in this blog post, we should only test assumptions that have the potential to provide high user value and which have high risk associated. As we want to minimize the uncertainty of the most impactful assumptions that our experimental hypotheses are based on, we rely on the concept of the "Riskiest Assumption Test" (RAT — read more on this concept here). The idea behind the RAT is to test the assumptions that can potentially have a strong effect on the product (high risk). "Risk" can be defined in terms of the potential effect on user behavior, or in terms of our uncertainty about whether the assumption is valid. If we rely on an assumption that we did not gather any previous insights, the uncertainty is high.

Whether an experiment is the best method to test the assumption depends on various factors such as:

  • What is the cost of the experiment?

  • Do we have enough traffic to evaluate the experiment quickly?

  • What is the chance we end up implementing the tested solution?

We add a limitation to our interpretation of “riskiest” in the RAT concept: in case the solution we are testing is associated with very high risk, there is also a higher chance that we end up not implementing it. Therefore, a usability test with mockups might be a better (cheaper) first step to test the underlying assumptions before running an experiment:

We run experiments to learn about our users

We run experiments to improve our product in a way that serves our users’ needs better. Therefore, we have to make sure that we have a solid understanding of our users’ needs in the specific domain we are experimenting on. For example, if we want to support our users in discovering relevant content in our product, we should have a good understanding about the different tasks that users are trying to accomplish with our product before we run experiments.

Each experiment should be set up in a way that enables us to learn about our users. We can often transfer learnings from one context to another. That's why we want to make sure we test the assumptions about our users in the most direct way possible so that we can update our theories about our users with the new insights we generate via the experiment. For example, in most cases we should not test two changes at the same time (unless you use a full-factorial design — read more in the next part of this blog post) because we will not be able to attribute the result of the experiment to the different changes we introduced. We should also aim to test assumptions about our user needs (e.g., "People don’t want to click like on a story if they dislike the title”) rather than testing specific solutions ("Users will click more on stories if we introduce a dislike button") (read more here).

Work that needs to be done before implementing the experiment

Designing an experiment properly requires a lot of work upfront — before writing any code. The first step for designing an experiment is defining the follow-up action you take in case you gather the evidence you are interested in:
"Statistics is the science of changing your mind under uncertainty, so the first order of business is to figure out what you’re going to do unless the data talk you out of it ... That’s why everything begins with a physical action/decision that you commit to doing if you don’t gather any (more) evidence." (Never start with a hypothesis)

Defining such a follow-up action often requires user research to make sure we actually address our users’ needs and not only experiment towards moving a certain metric. We should have a clear understanding about the user journey we are working on, and define a clear hypothesis based on assumptions. In this context, a hypothesis does not refer to the Null hypothesis we define for the experiment (we call this "statistical hypothesis"), here we are talking about the hypothesis about our users. A hypothesis usually has the following format:
"We believe that <assumption for a certain type of user>, and if we provide <feature/change> for them, they will <behaviour/metric change>"

The follow-up action should be defined based on a quantified expectation. This means that we do not only say “we expect a lift in conversion rate” but rather “we expect at least a 5% lift in conversion rate”. This helps to prevent the implementation of marginal improvements and is also important for determining the required sample size (“minimum detectable effect — read more in the third part of this blog post).

The following summarizes these requirements based on an example experiment to improve the usability of the bookmarking option on the ResearchGate feed:

In the bookmarking feature example, we can learn whether it is merely the visibility of the bookmarking feature that prevents users from using it. If this is not the case though, we will have to do more research or run more experiments to identify the reasons why users are not adopting it. We might for example have cluttered the feed with too many interaction options such that users feel overwhelmed. In this case, we can test in a different direction to make the feed more clean. Bookmarking might also feel like a burden to the user because the feed provides endless scrolling and potentially serves users too much content. In this case, we might want to experiment in a bolder direction, like making less but more relevant recommendations to users before working on the visibility details of the bookmarking feature.

Template for experiment setup

As a product analyst, you will save a lot of time if you ensure clarity about all prerequisites for experiment analysis upfront. We strongly recommend writing down the background/context section of the experiment documentation and to gather feedback from the design, PM and user research teams before the experiment is implemented by the engineers. We recommend using this template.

The next part of this blog post will focus on the setup of an experiment.