Recommendations as treatments

CausalInference

Recommendations

Joachims et al. reframe recommender systems as policies you can study with causal tools like inverse propensity weighting. My summary.

Published

June 17, 2024

This is the summary of “Recommendations as Treatments” (paper) by Joachims et al.

Image created by Dall-e: “Humorous illustration of Inverse Propensity Weighting. It features a quirky professor explaining the concept to amused students in a classroom, with a seesaw representing the ‘Treated’ and ‘Untreated’ groups.”

The main proposal of the paper is to review recommendation systems to be considered as policies that decide what interventions to make in order to optimize a desired outcome. This opens the door for us to apply Causal Inference techniques in the context of recommendations systems, such as Inverse Propensity Weighting.

Offline A/B testing or Off-policy evaluation of recommender systems

The idea is to use historical data to evaluate recommender systems and obtain unbiased estimates without resorting to online A/B tests, which can be slow, expensive and risky in terms of hurting the CX.

If we can evaluate different recommendation policies offline (i.e. which policy would have performed best, if we had used it instead of the policy that logged the data), we will be able to find optimal policies, i.e. innovate faster.

Problem with offline A/B tests

Consider, we have two recommendation policies/models: a) Logging policy (the current production model and b) Target policy (the new policy we would like to test). Since the historical evaluation data is actually produced by the Logging policy, the distribution of the data will follow the Logging policy’s output distributions or whatever it has learned to output. Evaluating the Target policy on this data will be biased by whatever the Logging policy favors/outputs.

For example:

Consider a simple movie recommender for a video streaming service. Imagine that, at each visit, the recommender selects a single movie to suggest to the users after they log in. Assume that we have collected data from this recommender using a stochastic policy whose distribution is weighted towards more popular movies, such as blockbuster superhero movies. When we use these data offline to evaluate new policies, we may mistakenly conclude that policies that favor superhero movies are better, since these movies are over-represented. This is the so-called ‘‘rich-get-richer” effect, wherein things that are already very successful become more so due to their ubiquity. At the same time, we may miss opportunities to provide more personalized recommendations due to insufficient data in certain niches. For instance, consider a user whose favorite genre is Scandinavian thrillers. They might also appreciate superhero movies, so recommending them an installment from the Avengers franchise is a safe bet. Yet that would be suboptimal; they would much prefer Midsommar, a horror film set in Sweden.

Another way to put it is to understand that both the actions and the rewards are biased by the production policy.

1. Actions in the logged data are biased because:

Production Policy’s Influence: The production policy (the system currently in use) determines which actions (e.g., recommendations) are made. If this policy has a preference for certain actions (like recommending popular movies), the data collected will be biased towards these actions.
Skewed Data: This means the observed actions in your data are not a random sample but are influenced by the production policy’s preferences. Therefore, some actions are overrepresented while others are underrepresented or not represented at all.

2. Rewards in the logged data are biased because:

Dependent on Actions: The rewards (e.g., clicks, purchases) are outcomes that result from the actions taken. Since the actions themselves are biased, the observed rewards are also biased. Popular items might receive more interactions simply because they are recommended more often, not necessarily because they are inherently better.
Conditional Bias: The rewards are conditional on the biased actions. For instance, the reward distribution for popular items might appear artificially inflated because these items are seen more frequently due to the production policy’s bias.

Solution using Inverse Propensity Weighting

In short, since the logged rewards are produced by the current production model, we can debias them by using the inverse propensity weights of the production model.

Let’s consider an example:

x is the contextual features
a is the action/recommendation by the policy. For the sake of explanation, let’s say we have 3 possible actions: a_A, a_B, a_C
\pi is the production policy. In this example, \pi’s propensity scores for the 3 possible actions are 0.75, 0.15, and 0.10
\pi_{\text{new}} is the new policy. In this example, \pi_{\text{new}}’s propensity scores for the 3 possible actions are 0.30, 0.6, 0.10
n is the total number of samples
r is the reward from the action

Case #1:

\pi recommends a_A, and the user likes the recommendation and clicks on it.

The utility score of \pi will be 1 (correct recommendations / total recommendations = 1/1 = 1). The utility score of \pi_{\text{new}} will be:

\underbrace{0.30}_{\pi_{\text{new}}(a_A)} \cdot \frac{1}{\underbrace{0.75}_{\pi(a_A)}} \cdot \frac{1}{n} = 0.30 \cdot \frac{1}{0.75} \cdot \frac{1}{1} = 0.4

Comparing the utilities, the production policy \pi (utility =1) looks better than the new policy \pi_{\text{new}} (utility =0.4).

Case #2:

\pi recommends a_A, the user clicks it, and \pi_{\text{new}} also has the highest propensity for a_A at 0.9.

The utility score of \pi will be 1. The utility score of \pi_{\text{new}} will be:

0.90 \cdot \frac{1}{0.75} \cdot \frac{1}{1} = 1.2

Now the new policy \pi_{\text{new}} (utility =1.2) looks better than the production policy \pi (utility =1).

Case #3:

\pi recommends a_A, but the user doesn’t like it and clicks a_B instead.

The utility score of \pi will be 0 (0 correct / 1 total). The utility score of \pi_{\text{new}} will be:

\underbrace{0.60}_{\pi_{\text{new}}(a_B)} \cdot \frac{1}{\underbrace{0.15}_{\pi(a_B)}} \cdot \frac{1}{n} = 0.60 \cdot \frac{1}{0.15} \cdot \frac{1}{1} = 4

Comparing the utilities, the new policy \pi_{\text{new}} (utility =4) looks better than the production policy \pi (utility =0).

In summary, the weighting mechanism of IPW helps us to understand how a new policy would have performed by explicitly handling the biases that exist in the offline evaluation data produced by the existing policy (e.g. the big get bigger effect).

By weighting each observed interaction with the inverse propensity of the production policy, we are essentially placing more importance on samples with lower probability to be recommended by the production policy and placing lower importance on samples with higher probability.

As illustrated in the examples above, this helps us to understand if the new policy performs better than the existing policy.

This summary captures the main idea behind using Causal Inference techniques in the context of recommendation engines. I have left out several interesting gems presented in the original paper (e.g. counterfactual risk minimization, fairness of recommendation engines, etc.), so have a look at the paper if you are interested in learning more :).

More causal inference notes: