Recommendations as treatments

CausalInference
Recommendations
Joachims et al. reframe recommender systems as policies you can study with causal tools like inverse propensity weighting. My summary.
Published

June 17, 2024

This is the summary of “Recommendations as Treatments” (paper) by Joachims et al.

Image created by Dall-e: “Humorous illustration of Inverse Propensity Weighting. It features a quirky professor explaining the concept to amused students in a classroom, with a seesaw representing the ‘Treated’ and ‘Untreated’ groups.”

The main proposal of the paper is to review recommendation systems to be considered as policies that decide what interventions to make in order to optimize a desired outcome. This opens the door for us to apply Causal Inference techniques in the context of recommendations systems, such as Inverse Propensity Weighting.


Offline A/B testing or Off-policy evaluation of recommender systems

The idea is to use historical data to evaluate recommender systems and obtain unbiased estimates without resorting to online A/B tests, which can be slow, expensive and risky in terms of hurting the CX.

If we can evaluate different recommendation policies offline (i.e. which policy would have performed best, if we had used it instead of the policy that logged the data), we will be able to find optimal policies, i.e. innovate faster.


Problem with offline A/B tests

Consider, we have two recommendation policies/models: a) Logging policy (the current production model and b) Target policy (the new policy we would like to test). Since the historical evaluation data is actually produced by the Logging policy, the distribution of the data will follow the Logging policy’s output distributions or whatever it has learned to output. Evaluating the Target policy on this data will be biased by whatever the Logging policy favors/outputs.

For example:

Consider a simple movie recommender for a video streaming service. Imagine that, at each visit, the recommender selects a single movie to suggest to the users after they log in. Assume that we have collected data from this recommender using a stochastic policy whose distribution is weighted towards more popular movies, such as blockbuster superhero movies. When we use these data offline to evaluate new policies, we may mistakenly conclude that policies that favor superhero movies are better, since these movies are over-represented. This is the so-called ‘‘rich-get-richer” effect, wherein things that are already very successful become more so due to their ubiquity. At the same time, we may miss opportunities to provide more personalized recommendations due to insufficient data in certain niches. For instance, consider a user whose favorite genre is Scandinavian thrillers. They might also appreciate superhero movies, so recommending them an installment from the Avengers franchise is a safe bet. Yet that would be suboptimal; they would much prefer Midsommar, a horror film set in Sweden.

Another way to put it is to understand that both the actions and the rewards are biased by the production policy.

1. Actions in the logged data are biased because:

2. Rewards in the logged data are biased because:


Solution using Inverse Propensity Weighting

In short, since the logged rewards are produced by the current production model, we can debias them by using the inverse propensity weights of the production model.

Let’s consider an example:

Case #1:

\pi recommends a_A, and the user likes the recommendation and clicks on it.

The utility score of \pi will be 1 (correct recommendations / total recommendations = 1/1 = 1). The utility score of \pi_{\text{new}} will be:

\underbrace{0.30}_{\pi_{\text{new}}(a_A)} \cdot \frac{1}{\underbrace{0.75}_{\pi(a_A)}} \cdot \frac{1}{n} = 0.30 \cdot \frac{1}{0.75} \cdot \frac{1}{1} = 0.4

Comparing the utilities, the production policy \pi (utility =1) looks better than the new policy \pi_{\text{new}} (utility =0.4).

Case #2:

\pi recommends a_A, the user clicks it, and \pi_{\text{new}} also has the highest propensity for a_A at 0.9.

The utility score of \pi will be 1. The utility score of \pi_{\text{new}} will be:

0.90 \cdot \frac{1}{0.75} \cdot \frac{1}{1} = 1.2

Now the new policy \pi_{\text{new}} (utility =1.2) looks better than the production policy \pi (utility =1).

Case #3:

\pi recommends a_A, but the user doesn’t like it and clicks a_B instead.

The utility score of \pi will be 0 (0 correct / 1 total). The utility score of \pi_{\text{new}} will be:

\underbrace{0.60}_{\pi_{\text{new}}(a_B)} \cdot \frac{1}{\underbrace{0.15}_{\pi(a_B)}} \cdot \frac{1}{n} = 0.60 \cdot \frac{1}{0.15} \cdot \frac{1}{1} = 4

Comparing the utilities, the new policy \pi_{\text{new}} (utility =4) looks better than the production policy \pi (utility =0).


In summary,  the weighting mechanism of IPW helps us to understand how a new policy would have performed by explicitly handling the biases that exist in the offline evaluation data produced by the existing policy (e.g. the big get bigger effect).

By weighting each observed interaction with the inverse propensity of the production policy, we are essentially placing more importance on samples with lower probability to be recommended by the production policy and placing lower importance on samples with higher probability.

As illustrated in the examples above, this helps us to understand if the new policy performs better than the existing policy.

This summary captures the main idea behind using Causal Inference techniques in the context of recommendation engines. I have left out several interesting gems presented in the original paper (e.g. counterfactual risk minimization, fairness of recommendation engines, etc.), so have a look at the paper if you are interested in learning more :).


More causal inference notes: