Experiment Design Guidelines for Product Analysts - Part 2/3

By Elisabeth Reitmayr

This is the second article of a three-part series that aims to add clarity and transparency around the way we work at ResearchGate. To read the first part of this blog post, please click here.

At ResearchGate, we run a lot of experiments to improve our product for our users. Our experiment design guidelines for product analysts establish the process for setting up those experiments from the analytical and statistics perspective to ensure we can evaluate the experiment as intended. These guidelines give some hints, but do not fully cover the product management, user research and design perspective, i.e. what to experiment on. In the second part of this series, we focus on the setup of an experiment.

We are interested in your thoughts on these guidelines. Please send any feedback to [email protected]

Possible comparisons

A/B/n test

An A/B test (or A/B/C/… in case you’re testing more than two variants) is the most typical setup — you compare one or more versions of a feature against a control group. (The control is your baseline against which you want to test the changes.) Here we only introduce one change per variant (referring back to the example from the first post), e.g. we only change the color of the bookmark button on the feed, not its position or anything else.

Multivariate test/full-factorial design

Sometimes we want to test the effect of more than one change in our product on user behavior, as well as the interaction effect of those changes. For example, we want to test a new color for the bookmark button on the feed, and also a new position of this button. Here we need to make sure that we are able to attribute the effect of the experiment on user behavior to the different changes we are making by using a full-factorial design. This means that we test all possible combinations of changes. Otherwise, we will not be able to disentangle the effects of the different changes during the analysis.

Example: Changes to the bookmark button on the feed

  • Variant A: Old position and old color (control group)

  • Variant B: Old position and new color

  • Variant C: New position and old color

  • Variant D: New position and new color

If we would only test Variant A against Variant D in this example, we would not be able to learn which extent of the effect is attributable to the position change vs. the color change.

Before-after test

A before-after experiment compares a measurement before introducing a change to the measurement after introducing the change. This means that there is no control group, which makes it hard to infer that the change we are investigating is actually the cause of the change in measurement that we observe after the treatment. For example, there may be seasonal effects at play that actually cause the effect we are observing rather than the treatment.
In most cases this kind of experimental setup is difficult for product experiments because you have to account for seasonality effects. If you have a solid understanding from a decent amount of seasonal data, this is possible, but in most cases it is easier to implement an A/B test. Examples for before-after experiment cases are the introduction of our mobile app (where we compare users' behavior before and after introducing the app), or the effect of "natural experiments'' (e.g. changes due to COVID-19) on our users' behavior. Those are exceptions, though, as we set up most experiments as A/B/n tests.

Goal metric of the experiment

What is a goal metric?

The goal metric is the central metric of our experiment that makes the underlying product hypothesis (see first blog post) measurable. The goal metric is a part of both your product hypothesis and your statistical hypothesis.

Hypothesis: Making the bookmarking interaction more visible in the home feed will lead to more items bookmarked per feed session. (A feed session is defined as a session in which the user visits the feed.)

Null hypothesis (“statistical hypothesis”): There is no significant difference between the items bookmarked per feed session variant making the bookmarking interaction more visible and the control variant at (alpha = 5%).

How do we select the goal metric?

We often set strategic goals for teams, which are derived from a certain capability that we want to improve in our product to better serve our users' needs. For example, in one team, we had the goal to increase the average number of content consumptions per session. We defined this goal to contribute to our strategic goal of increasing content consumption on our platform, which we consider a proxy for improving our capabilities for helping users to discover relevant content on ResearchGate. While experimenting on our product, we want to get a better understanding about the drivers of this metric, i.e. which input metrics help us to drive the strategic goal (average number of content consumptions per session) in our team.

Ideally, we would want to have exactly this strategic goal as the goal metric for our experiment to better understand what drives this goal metric. However, experiments usually target a very specific part of the user journey that might only have small effects on the strategic goal. For example, if we change the design of a button on the feed that makes the download option more visible, this might have a considerable effect on the click-through-rate on this button. However, we can only expect a tiny effect on the average number of content consumptions per session from this experiment. Therefore, the goal metric of your experiment here should not be the average number of consumptions per session (it is very unlikely that you would find a meaningful effect here) but rather the click-through-rate on the button we are changing, or potentially the publication consumptions from the feed.

Overall, we should always test on a metric that is closely related to the change that we introduce to your experimental version of the product. The strategic goal can be a secondary subject to evaluation: we can compare it between the experimental variants to build a better understanding of how different input metrics change this strategic metric, or to ensure that we don’t introduce cannibalization effects (e.g. if we highlight the full-text download button on the feed, does this increase publication consumptions at the cost of question consumptions?).

The next part of this blog post will focus on sampling for experiments.