Applied Data Science & ML

Changes to my workflow

Sat, 11 Apr 2026 15:00:00 GMT

It’s been a while. I’ve been training for a triathlon and between that, life changes, and the day job, writing blog posts fell off the priority list. But something has shifted enough in how I work that I wanted to document it before it becomes normal and I forget what it replaced.

As I have mentioned in my previous posts, I build data pipelines, ML and causal inference models, and internal analytics tools. Currently, I’m working on a research paper and pushing toward a conference deadline. For the past several months, I’ve been using an internal coding assistant that lives in my terminal: it reads my files, runs shell commands, talks to cloud services, and maintains context across long sessions. Not autocomplete. Closer to a junior engineer who has read every file in your repo and takes direction well (conditional on the task), but will confidently ship untested code if you let it.
The interesting thing isn’t the code generation. It’s that my job has quietly shifted from writing code to delegating work and evaluating output. That’s a different skill entirely, and most people using these tools haven’t realized the transition is happening to them.

API That Went Through Five Revisions

I was building an analysis engine for a serverless function that loads a hierarchical metric tree from a data warehouse (think: a top-level KPI decomposes into sub-metrics algebraically, each decomposing further across ~100 nodes), and propagates what-if changes through the tree. The agent wrote the first version in hours. It worked. Tests passed.

What followed was five rounds of refactoring the same module. Each iteration, the agent rewrote the code, updated tests, redeployed. I kept pushing: “no, these columns aren’t tree nodes, they’re raw inputs to derived ratios. The naming should reflect that.” A flat lookup dict became a typed configuration with a dataclass. Each round got cleaner.

I was doing design work, not typing work. The coding agent handled mechanical refactoring: renaming across files, updating assertions, regenerating query logic. I decided whether the abstraction made sense.

The code review came back with a pointed question: “Do we still need this data-loading mode?” I initially argued yes! The reviewer was right though: another tool already handled data inspection. And because each service should do one thing, we ripped the duplicated data loading module out, rewrote the interface from four modes to two, updated three packages, revalidated all nodes, and merged. The agent did all the mechanical work and I was responsible for detecting “smelly” work and push back when necessary.

Writing Math Proofs

This is where it got interesting. My paper extends a recent sensitivity analysis framework from linear models to double machine learning. The theory section was a 23-line stub. My collaborator said it needed to become the centerpiece. I had the agent download the original paper from arxiv, extract the proofs, and explain each theorem. Key finding was that the original proofs use linear algebra projections everywhere and they don’t extend to nonparametric models. We couldn’t just cite them. We needed original proofs.

After a comprhensive literature review, a three-tier proof structure emerged in one session. Proposition 1: the linear case reduces exactly to the original framework. Proposition 2: additive models, consistency via decomposition. Proposition 3: general nonlinear case via concentration inequalities. The agent wrote the LaTeX. I checked the math. Then I told it to run an adversarial audit on its own proofs.

It found two critical errors. First, it used an inequality that requires independent random variables. Our sampling scheme produces negatively correlated variables. Wrong inequality. We switched to one that handles dependent sampling. Second, it claimed a variance function is always concave. I was skeptical. It produced a counterexample with two interacting binary variables where the marginal information gain *increases* which is the opposite of concavity. The claim was false.

The agent wrote a proof, disproved part of its own proof with a concrete counterexample, then rewrote the proposition with an honest three-case treatment. I’ve worked with human collaborators who wouldn’t catch that. At this point it feels like if I throw as many tokens as time allows and provide clean and organized context to the agent, solving great number of problems is just a matter of time (not gonna argue that its able to find the pareto optimal solution or not).

Thousands of Cloud Training Jobs

Next, for the above mentioned paper we needed a massive simulation study with nearly 3,000 configurations across effect sizes, confounding strengths, and increasing sample sizes. Each runs as a cloud training job. The agent wrote the dispatch script.

The dispatch job died at job 194. Root cause: O(n) bottleneck. For every new job, the script polled the status of all previously launched jobs. API rate limits turned this into a 40-second check per iteration. Throughput collapsed. Credentials expired.

But to fix this, the agent researched the cloud API’s behavior, found that job creation fails synchronously at the quota limit rather than queuing, and redesigned as fire-and-forget with retry-on-quota. The new version dispatched at maximum API rate, handled quota limits with backoff, and added resume flags for credential expiry. After this fix, all jobs running.

Then it collected results, identified which hyperparameter actually dominates model bias (not the one I expected a 3x range from a single parameter), and generated the appendix analysis. Dispatch, collection, analysis in ONE day. That’s a week/s of deep focus work without the coding agent.

Where It Goes Wrong

One time the agent submitted a code review for three packages without deploying or running end-to-end validation. I managed to catch it and told the agent: “We must test the entire pipeline before sending the review. Actually make sure this never happens again.” To handle this, we added a pre-submission validation checklist as a permanent rule. The learning from this is that the agents at the current state always try to optimize for task completion and will skip validation unless we explicitly require it. One solution that seems to work well for this is to set up “hooks” that run automatically after builds. This way you can make the validation deteministic and dont rely on the agent to remember the routine, which becomes more of an issue with the increase in context sizes.

Another pattern that I need to mention was when the paper draft had stale numbers from an old experiment. Agent’s adversarial audit caught it, which led to a key statistic changing meaningfully when we switched to proper cross-validation. The old numbers were wrong and had been sitting in the draft for days. The agent can catch errors, but only when you tell it to look. It doesn’t spontaneously doubt its own previous output, and in most cases, it will say that everything has been implemented and tested thoroughly. To create the adversarial audit, I have worked with the agent to read and compile the literature about what kind of steering intructions and methods tend to work well with coding agents. As a result, we created a manually invoked instruction document that I invoke whenever something smells fishy. Works pretty well so far.

Delegation as the Core Skill

Here’s what I think most people miss. Using these tools well is not about prompt engineering or knowing the right magic words. It’s about delegation and evaluation which are the same skills you need to manage people, applied to a machine.

I write instruction sets loaded by task phase: thinking, researching, implementing, reviewing. I run adversarial audits where the agent attacks its own work. I enforce checklists. I push back when the abstraction is wrong. It’s less like pair programming and more like being a tech lead for a very fast, very literal engineer who occasionally produces brilliant work and occasionally tries to ship without testing.

The evaluation part is critical and underappreciated. When the agent writes a proof, I need to know enough math to verify it. When it refactors a data mapping, I need to understand the schema well enough to judge whether the new abstraction is better. When it dispatches cloud jobs, I need to understand the API limits to know if the retry logic makes sense. I think I now understand what people meant when they said “AI amplifies your existing expertise”. If you don’t have the expertise, it amplifies your ability to produce confident-looking garbage.

I wrote proofs, shipped a production API, dispatched thousands of training jobs, and debugged infrastructure WITHIN the same week and while training for a triathlon. A few months ago, that would have been a month of just the engineering work. The tools are crude, context windows degrade, and the agent drifts without anchoring. But for the kind of work where the hard part is the science and the design, and not the typing, it’s already a different game for me. And to think I’ve only started investing into learning agentic coding this January…..

Cheers, see yall next time!

Going down a random rabbit hole: From XML Tags to $100M Weight Updates

Tue, 03 Mar 2026 15:00:00 GMT

I started with a small question: why does Anthropic insist on XML tags? It sounded like a formatting preference. The deeper I dug into the engineering, the more it looked like XML is less a style choice and more a statistical anchor the model’s parameters were trained to recognize. Here is the thread I pulled on.

1. The XML “attention fence”

The reason XML reads like a native interface to Claude comes down to how its self-attention heads were trained.

During pre-training, untrusted data gets wrapped in tags. That teaches the model to lower the weight it places on tokens inside something like when it is predicting the next token for a system instruction. The tags end up acting as a structural fence, which is what keeps the model from mistaking your background data for a new command (sometimes called attention contamination).

2. Phase 1: supervised fine-tuning as behavioral cloning

This is the stage where the model learns the constitution, a rulebook of principles. It isn’t reasoning yet, it is mimicking.

The loop: the model generates a response, critiques it against the constitution, revises it, and then gets trained to clone that revised version. The mechanism is cross-entropy loss. For every token, the model outputs a probability vector over its full vocabulary (on the order of 100k tokens). You compare against a one-hot ground-truth vector , which is all zeros except a at the correct token’s index:

If the model put 40% on “Hello” when the truth was at 100%, backpropagation adjusts weights across the network to make the statistically favored choice next time.

3. Phase 2: RLAIF and PPO, the optimization stage

Once the model knows how to speak, this phase teaches it judgment, using reinforcement learning from AI feedback. The PPO objective looks like:

Three pieces worth pulling apart:

The reward. An AI judge model scores responses. Follow the constitution and use structure correctly, and the response earns a positive scalar; the logic that produced it gets reinforced.
The KL penalty. This is the part that matters. If the model tries to game the reward by drifting into gibberish the judge happens to like, the KL divergence term spikes and pulls it back toward the stable language model from Phase 1.
PPO clipping. Proximal Policy Optimization caps how far the weights can move per update, which prevents model collapse from a few outlier rewards.

4. Scaling: the infrastructure bill

Classic RLHF is bottlenecked by human reading speed. RLAIF removes the human from the inner loop, so the feedback runs at GPU-cluster speed instead. The cost isn’t labor at that point, it is the electricity and compute to run these self-improvement loops at scale.

The takeaway. Using XML tags isn’t about being tidy. It is about lining your prompt up with the statistical patterns the model’s weights were optimized to reward in the first place.

Estimating the Distribution of Omitted Variable Bias in Causal Inference

Sat, 05 Apr 2025 15:00:00 GMT

I went down a rabbit hole on omitted variable bias, specifically the question of not just whether a confounder biases your estimate, but how to put a range on how badly. These are my notes on the main families of methods people use, what each one buys you, and where each one breaks.

What OVB actually is

Omitted variable bias shows up when you leave out a variable that is correlated with both your treatment of interest and your outcome. The model has nowhere to put that variable’s influence, so it smears it onto the coefficients you did include, and your estimate of the causal effect drifts up or down.

The classic example: regressing salary on years of education while leaving out ability. If ability raises both how much schooling someone gets and how much they earn, the education coefficient absorbs part of ability’s effect, and you overstate the return to education.

Two conditions have to hold for the bias to exist:

the omitted variable genuinely affects the outcome, and
it is correlated with an included regressor.

If either fails, omission is harmless. The direction of the bias follows the signs: same-sign correlations push the estimate up, opposite signs push it down. Reasoning through those signs without knowing the magnitude is what people call “signing the bias.”

The algebra

Suppose the true model is

but you omit and fit

Then the estimated coefficient on has expectation

where is the coefficient from regressing the omitted on the included . The bias term is : it is zero exactly when has no effect on () or when and are uncorrelated (). Clean, but not directly usable, because and both involve the thing you never observed. That is the whole problem, and every method below is a different way around it.

Five ways to put a range on the bias

1. Sensitivity analysis. Instead of assuming no confounding, ask how strong a confounder would need to be to overturn your conclusion. The robustness value captures this: the minimum association (measured as partial ) an unobserved confounder must have with both treatment and outcome to drive the effect to zero. If a weak, implausible confounder is enough to flip the result, the finding is fragile; if only an implausibly strong one would do it, the finding is robust. Cinelli and Hazlett’s contour plots are the standard way to read this off visually, showing where the estimate stays significant across combinations of confounder strength.

2. Bounding approaches. Rather than scan scenarios, fix an assumption about the most explanatory power any omitted variable could plausibly have, and derive hard upper and lower limits on the effect. The bounds are only as credible as that ceiling assumption. Set it too low and the true effect can fall outside your interval; set it too high and the bounds are so wide they say nothing. Covariate benchmarking anchors the assumption in data: argue that no unobserved confounder is stronger than, say, your strongest observed covariate, and use that as the empirical ceiling.

3. Simulation. Build synthetic datasets where you control the omitted variable’s effect, then estimate the treatment effect while deliberately leaving it out, repeatedly. The spread of estimates is the empirical distribution of the bias. Useful for two things: seeing which conditions make OVB worst (stronger confounder-regressor correlation, larger effect on the outcome), and validating that a sensitivity method’s claimed bounds actually cover the truth.

4. Bayesian methods. Put a prior on the unobserved confounder’s parameters (or directly on the bias), and the posterior on the causal effect carries that uncertainty through. You get a full distribution over plausible effects instead of a point estimate. The catch is the obvious one: if the posterior moves a lot when you change the prior, the data isn’t doing the work, your assumptions are.

5. Machine learning. When relationships are non-linear or high-dimensional, flexible models estimate the nuisance functions better than a hand-specified linear model. DoubleML now ships OVB sensitivity analysis inside the double-machine-learning framework, and methods like BART give uncertainty estimates directly. The open problem is the usual one with flexible models: quantifying causal uncertainty rigorously, and explaining what the model is actually conditioning on.

How they compare

Method	Strength	Where it breaks
Analytical	Gives the fundamental picture of how OVB arises.	Relies on quantities you can’t observe.
Sensitivity analysis	Quantifies how much confounding it takes to flip the result.	Gives no single “corrected” estimate.
Bounding	Returns an actual range for the effect.	Only as good as the plausibility ceiling you assume.
Simulation	Controlled, lets you validate other methods.	Conclusions are hostage to the data-generating process you chose.
Bayesian	Carries uncertainty through to a full posterior.	Sensitive to prior specification.
Machine learning	Handles non-linear, high-dimensional confounding.	Uncertainty quantification for causal effects is still maturing.

What I took away

The honest core of all of it: you cannot measure what you did not observe, so every method here trades the impossible question (“what is the bias?”) for a tractable one (“how strong would a confounder have to be, and is that plausible here?”). Sensitivity analysis and covariate benchmarking are the two I find most useful in practice, because they hand the judgment back to domain knowledge instead of hiding it inside a prior or a simulated data-generating process.

References

The starting points worth reading if you want the real treatment, not my notes:

Cinelli & Hazlett (2020), Making Sense of Sensitivity: Extending Omitted Variable Bias. The robustness value and contour plots. PDF
Chernozhukov et al. (2021), Long Story Short: Omitted Variable Bias in Causal Machine Learning. Extends OVB to ML. arXiv
DoubleML, sensitivity analysis documentation.
Omitted-variable bias on Wikipedia for the algebra.
Eggers, Intermediate Causal Inference lecture slides.

More causal inference notes:

When Linear Regression Gets Massively Confused

Fri, 14 Mar 2025 15:00:00 GMT

This one is about a failure mode I keep running into: a mass point, a large cluster of identical values in your data, quietly breaking linear regression. It often shows up right after a log transform, and the symptoms are easy to misread.

What a mass point is

Say a large number of your observations share the same value, for example a crowd of customers who all bought a $1 product. You log-transform sales to make the distribution more normal, and every one of those $1 values collapses to zero, since . Now the transformed data is dominated by a spike of zeros.

Visually: the original data has a tall spike at , and after the log transform that spike moves to and gets taller relative to everything else.

Why linear models struggle with it

Linear regression leans on a few assumptions:

Linearity: the relationship between predictors and outcome is linear.
Independence of errors: residuals are independent of each other.
Homoscedasticity: residuals have constant variance across all levels of the predictors.
Normality: residuals are roughly normally distributed.

A mass point mainly attacks the last two. With a big concentration of identical values (here, the spike at ), two things go wrong:

Non-normal residuals. The mass point pulls residuals toward itself, introducing asymmetry and distorting their spread, so they no longer sit balanced and normal around zero.
Heteroscedasticity. Residual variance stops being constant, because the model can’t fit the dense region and the sparse region equally well. That matters because non-constant variance gives you biased standard errors (so t-tests and confidence intervals become unreliable) and inefficient estimates (predictions are less precise than they could be).

The net effect: the model tries to fit a straight line through data that violates its own assumptions, and you get biased or inefficient estimates. On a scatter plot you can see the fitted line bending awkwardly around the cluster at zero.

Models that handle it better

Instead of forcing a single linear fit, use a model that explicitly accounts for the mass point. Three options, with tradeoffs:

Model	Key idea	Pros	Cons	When to use
Hurdle	A classifier first decides whether an observation is in the mass-point group; a separate regression models the continuous values for the rest.	Cleanly separates the spike from the continuous part.	Two models to fit and coordinate.	A mix of a mass point (e.g. lots of or zeros) and continuous data.
Mixture	Model the data as a blend of two distributions: one for the spike, one for the smooth continuous part.	Flexible, captures distinct subpopulations in one model.	Harder to fit; estimating the mixture components can be unstable.	Data that naturally splits into a sharp spike plus a smooth distribution.
Zero-inflated	A logistic component models excess zeros; a continuous model handles the rest.	Purpose-built for far more zeros than a standard model expects.	Best only when the mass point really is at zero; more setup.	An unusually high count of zeros (or one specific mass point).

Takeaway

A giant spike at zero in log-transformed data is a signal, not noise to ignore. Linear regression will fit it, but the standard errors and efficiency you rely on quietly stop being trustworthy. A hurdle, mixture, or zero-inflated model fits the structure of the data instead of fighting it.

Appendix: why heteroscedasticity matters

Effect on the estimates

Biased standard errors. Under heteroscedasticity the standard errors of the coefficients are unreliable, which throws off confidence intervals and hypothesis tests and raises the chance of Type I or Type II errors.
Inefficient estimates. OLS stays unbiased, but it is no longer the best linear unbiased estimator (BLUE). The estimates have higher variance than necessary, so they are less precise.

Why efficiency is worth caring about

Efficiency means getting estimators with the smallest possible variance. It buys you two things:

Precision. Tighter confidence intervals, so the estimates are more reliable.
Statistical power. More precision means tests are likelier to detect real effects, lowering the risk of Type II errors.

Causal Inference: Assessing Overlap in Covariate Distributions

Sat, 31 Aug 2024 15:00:00 GMT

This one summarizes what I’ve captured from Chapter 14 of “Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction” book by Guido W. Imbens and Donald B. Rubin.

Few shortcuts I’ve used:

T - shortcut for treatment group
C - shortcut for control group

Covered sections and their summary:

14.1 Introduces why adjusting for differences in the covariate distributions is needed
14.2 Goes over a case with two univariate distributions
14.3 Goes over a case with two multivariate distributions
14.4 Explains the role of propensity of scores in assessing the overlap of covariate distributions
14.5 Provides a measure to assess if each treatment/control unit has an identical non-treated/treated twin
14.6 Provides examples (skipped)

1. Why do we need to adjust for differences in the covariate distributions?

If there is a region in the covariate space where T has relatively few units/samples or C has relatively few units/samples, our inferences in that region will largely depend on extrapolation, thus will be less credible compared to inferences of regions where both T and C have substantial overlap in the covariates distribution.

Example:
Covariate space #1:

T has 5 males with age<=18
C has 55 males with age<=18.

Covariate space #2:

T has 55 males age>18 and age<=30
C has 60 males age>18 and age<=30

In the above example, our inferences for the covariate space #2 will be more credible than our inferences for the covariate space #1.

To note, even in cases when there is there is no confounding (unconfoundedness), this is still a fundamental issue. However, if we have a completely randomized experiment, we can expect the distribution of covariates to be similar per definition, and thus, less risks of stumbling upon this issue.

2. Assessing overlap of univariate distributions

1. Comparing differences in location of two distributions

Given two univariate probability distributions and with means and , and variances and , we can estimate the differences in locations of the two distributions with:

, which is a normalized measure of difference of two distributions. To note, this is different from the t-statistic that tests whether the data contain sufficient info to support the hypothesis that two covariate means in T and C are different:

To our purposes, we don’t care about testing if we have enough data to check if means are different between T&C. Instead, we care about understanding if the differences between the distributions are so large that we need to fix the issue, and at the same time, check what kind of adjustment methods are required for the problem at hand.

2. Comparing measures of dispersion of two distributions

The log difference in the standard deviations helps to understand the difference in dispersion:

, we take the log because its more normally distributed compared to simple difference or ratio of standard deviations.

3. Comparing the tail overlaps of two distributions

To understand the overlap of covariate distributions in the tail regions, we can calculate the fraction of treatment units that exist in the tails of the distribution of covariate values in the control.

One method is to choose an arbitrary tail boundary, e.g. and calculate the probability mass of T’s covariate distribution outside the and quantiles of the C’s covariate distribution:

, where and represent the CDFs of the T and C covariates’ distributions (so is the quantile function of the control distribution). We can do the same for C, where we calculate the probability mass of C’s covariate distribution outside the and quantiles of the T’s covariate distribution:

The intuition here is that it is relatively easy to impute outcomes values of either T or C units in the dense regions of the distribution (e.g. near the mean of a normal distribution where majority of samples are concentrated), while it is harder at the tail regions where only few units exist. One can refer to the following two distributions as an example of two differently dispersed covariates’ distributions:

Screenshot 2024-09-03 at 17.35.21.png

Last but not least, here’s how to interpret the values:

happens for a completely randomized experiment, where only of units have covariate values that make us cry (i.e. difficult to impute the missing potential outcomes of a group).
if , it will be relatively difficult to impute/predict the missing potential outcomes for the control units (more treatment units exist in the tail region of the control group’s covariate distribution).
if , it will be relatively difficult to impute/predict the missing potential outcomes for the treatment units (more control units exist in the tail region of the treatment group’s covariate distribution).

3. Assessing overlap of multivariate distributions

Given K covariates, we could compare the distributions of T & C iteratively one-by-one using the above approach. A good way is to start with the covariates for which you have a priori belief that it is highly associated with the outcome (i.e. ensuring the importatnt covariates are okay).

Using Mahalanobis distance to compare differences of multivariate distributions

In addition to above metrics that can be used for univariate distributions, there is a neat multivariate summary measure that captures the difference in locations between two distributions similar to the one mentioned in the first part of section 2. It leverages the Mahalanobis distance and takes as input the 2 by K-dimensional vector of distribution means of each covariate and :

, where

are the covariance matrices of distribution means for C & T, respectively. Intuitively, a larger Mahalanobis distance tells us that the distributions are further apart in multivariate space, considering both location and spread.

*Using propensity scores to c**ompare distributions*

We can use propensity scores to assess the balance of covariate distributions is that any difference in covariate distributions also shows up in the difference of propensity score distributions. In principle, difference in covariate distributions leads to difference in the expected propensity scores of T&C, i.e. the average propensity scores. So if there’s a non-zero difference in the propensity score distributions for T&C, it also implies there is a difference in the covariate distributions for T&C.

Since it is much less work to analyze the differences of univariate distributions compared to multivariate ones, leveraging propensity scores and their distributions which is a univariate distribution to analyze the differences makes things a bit simpler.

How does it work? Consider t(x) to be the true propensity score (assuming we know the treatment assignment mechanism) and l(x) to be the linearized propensity score (log odds ratio) of being in T vs C given covariates x:

Then, we can simply look at the normalized difference in means for the propensity scores of each treatment, where:

are the average values for propensity scores of C & T units, and:

are the variances of the propensity score, which leads us to the estimated difference in average propensity scores scaled by the square root of the average square within-treatment-group standard deviations:

To note, if we are using linearized propensity scores, the propensity score function is scale-invariant, so there is not actually a need to normalize the above by the standard deviations.

Moreover, compared to how we assessed the univariate covariate distributions, there are some points to be aware of:

Differences in the covariate distributions are implied by the variation in the propensity scores.
If the treatment assignment mechanism is biased some way then it is possible that the covariate distributions are similar, but we observe a difference in the propensity score distributions.
On another hand, the treatment assignment mechanism might not be biased, and the covariate distributions do indeed differ (implied by the difference in propensity score distributions).
If the covariate distributions of two treatments differ, then it must be that the expected value of the propensity score in T is larger than the expected value of the propensity score in the C (or vice-versa).

As a result, we can understand that differences in covariate distributions in T vs C imply, and are implied by, the differences in the average value of the propensity scores of T & C. Proof is captured in the book Chapter 14.4.

4. Estimating if T/C units have comparable twins in C/T

Consider a unit with treatment . For this unit we can determine if there is a comparable twin with treatment such that the difference in propensity scores is less than or equal to , where is an upper threshold implying the difference in propensity scores is less than 10%.

Intuitively, if there is a similar unit with opposite treatment and a similar propensity score, we may be able to obtain trustworthy estimates of causal effects without any extrapolation or additional work. However, if there are many such units without twins, it will be difficult to obtain credible estimates of causal effects, no matter what methods we use. A common method to solve this, given we have enough samples, is to trim the units without twins.

We can estimate the degree of units with close comparison units by defining an indicator variable that flags whether unit has a comparable twin:

, then for each T & C we can define two overlap measures implying the proportion of units with close comparisons:

More causal inference notes:

Recommendations as treatments

Sun, 16 Jun 2024 15:00:00 GMT

This is the summary of “Recommendations as Treatments” (paper) by Joachims et al.

Image created by Dall-e: “Humorous illustration of Inverse Propensity Weighting. It features a quirky professor explaining the concept to amused students in a classroom, with a seesaw representing the ‘Treated’ and ‘Untreated’ groups.”

The main proposal of the paper is to review recommendation systems to be considered as policies that decide what interventions to make in order to optimize a desired outcome. This opens the door for us to apply Causal Inference techniques in the context of recommendations systems, such as Inverse Propensity Weighting.

Offline A/B testing or Off-policy evaluation of recommender systems

The idea is to use historical data to evaluate recommender systems and obtain unbiased estimates without resorting to online A/B tests, which can be slow, expensive and risky in terms of hurting the CX.

If we can evaluate different recommendation policies offline (i.e. which policy would have performed best, if we had used it instead of the policy that logged the data), we will be able to find optimal policies, i.e. innovate faster.

Problem with offline A/B tests

Consider, we have two recommendation policies/models: a) Logging policy (the current production model and b) Target policy (the new policy we would like to test). Since the historical evaluation data is actually produced by the Logging policy, the distribution of the data will follow the Logging policy’s output distributions or whatever it has learned to output. Evaluating the Target policy on this data will be biased by whatever the Logging policy favors/outputs.

For example:

Consider a simple movie recommender for a video streaming service. Imagine that, at each visit, the recommender selects a single movie to suggest to the users after they log in. Assume that we have collected data from this recommender using a stochastic policy whose distribution is weighted towards more popular movies, such as blockbuster superhero movies. When we use these data offline to evaluate new policies, we may mistakenly conclude that policies that favor superhero movies are better, since these movies are over-represented. This is the so-called ‘‘rich-get-richer” effect, wherein things that are already very successful become more so due to their ubiquity. At the same time, we may miss opportunities to provide more personalized recommendations due to insufficient data in certain niches. For instance, consider a user whose favorite genre is Scandinavian thrillers. They might also appreciate superhero movies, so recommending them an installment from the Avengers franchise is a safe bet. Yet that would be suboptimal; they would much prefer Midsommar, a horror film set in Sweden.

Another way to put it is to understand that both the actions and the rewards are biased by the production policy.

1. Actions in the logged data are biased because:

Production Policy’s Influence: The production policy (the system currently in use) determines which actions (e.g., recommendations) are made. If this policy has a preference for certain actions (like recommending popular movies), the data collected will be biased towards these actions.
Skewed Data: This means the observed actions in your data are not a random sample but are influenced by the production policy’s preferences. Therefore, some actions are overrepresented while others are underrepresented or not represented at all.

2. Rewards in the logged data are biased because:

Dependent on Actions: The rewards (e.g., clicks, purchases) are outcomes that result from the actions taken. Since the actions themselves are biased, the observed rewards are also biased. Popular items might receive more interactions simply because they are recommended more often, not necessarily because they are inherently better.
Conditional Bias: The rewards are conditional on the biased actions. For instance, the reward distribution for popular items might appear artificially inflated because these items are seen more frequently due to the production policy’s bias.

Solution using Inverse Propensity Weighting

In short, since the logged rewards are produced by the current production model, we can debias them by using the inverse propensity weights of the production model.

Let’s consider an example:

is the contextual features
is the action/recommendation by the policy. For the sake of explanation, let’s say we have 3 possible actions: , ,
is the production policy. In this example, ’s propensity scores for the 3 possible actions are , , and
is the new policy. In this example, ’s propensity scores for the 3 possible actions are , ,
is the total number of samples
is the reward from the action

Case #1:

recommends , and the user likes the recommendation and clicks on it.

The utility score of will be 1 (correct recommendations / total recommendations ). The utility score of will be:

Comparing the utilities, the production policy (utility ) looks better than the new policy (utility ).

Case #2:

recommends , the user clicks it, and also has the highest propensity for at .

The utility score of will be 1. The utility score of will be:

Now the new policy (utility ) looks better than the production policy (utility ).

Case #3:

recommends , but the user doesn’t like it and clicks instead.

The utility score of will be 0 (0 correct / 1 total). The utility score of will be:

Comparing the utilities, the new policy (utility ) looks better than the production policy (utility ).

In summary, the weighting mechanism of IPW helps us to understand how a new policy would have performed by explicitly handling the biases that exist in the offline evaluation data produced by the existing policy (e.g. the big get bigger effect).

By weighting each observed interaction with the inverse propensity of the production policy, we are essentially placing more importance on samples with lower probability to be recommended by the production policy and placing lower importance on samples with higher probability.

As illustrated in the examples above, this helps us to understand if the new policy performs better than the existing policy.

This summary captures the main idea behind using Causal Inference techniques in the context of recommendation engines. I have left out several interesting gems presented in the original paper (e.g. counterfactual risk minimization, fairness of recommendation engines, etc.), so have a look at the paper if you are interested in learning more :).

More causal inference notes:

Causal Inference cheatsheet

Sun, 14 Jan 2024 15:00:00 GMT

I keep forgetting which method assumes what, so I put together a single page I can scan before reaching for a technique. It follows the terminology in Matheus Facure’s Causal Inference for the Brave and True, which is the resource I keep coming back to.

It is not exhaustive, and it is not a substitute for reading the book. It is the thing I glance at to remember what “regression discontinuity” actually buys me before I commit to it.

Core concepts

Concept	What it means	Example
Causal Inference	Determining the cause-and-effect relationship between variables.	The impact of a new drug on patient recovery.
Treatment / Intervention	The variable or action being studied for its effect on an outcome.	A new teaching method.
Outcome	The variable that is influenced by the treatment.	Student test scores.
Confounder	A variable that influences both the treatment and the outcome, biasing the estimated effect.	Age in a study linking exercise to heart health.
Randomized Controlled Trial (RCT)	Units are randomly assigned to treatment or control to ensure comparability. See the chapter.	Randomly assigning patients to a drug or a placebo.
Observational Study	You observe the effect of treatments without controlling assignment.	Smoking and lung cancer, studied from existing data.
Counterfactual	What would have happened to the same units under a different treatment.	The unemployment rate had the stimulus not passed.
Selection Bias	Bias from studying subjects who are not representative of the population.	Only healthy volunteers enroll, inflating the drug’s apparent effect.
Instrumental Variables (IV)	A variable that affects treatment but only touches the outcome through it.	Distance to the nearest college as an instrument for education.
Difference-in-Differences (DiD)	Compare outcome changes over time between a treatment and a control group.	A new law’s effect, comparing regions before and after.
Regression Discontinuity (RD)	Use a cutoff to assign treatment, compare units just above and below it.	A scholarship’s effect, comparing students around the eligibility line.
Propensity Score Matching	Match treated and untreated units with similar probability of being treated.	Matching patients on demographics and clinical history before comparing.
Synthetic Control	Build a weighted blend of control units to stand in for the treated one. See the chapter.	A policy’s effect in one country vs a synthetic of several others.
Mediation Analysis	How an intermediate variable carries the effect from cause to outcome.	Stress reduction mediating exercise and mental health.
Natural Experiment	A real-world event that mimics random assignment.	A natural disaster’s effect on economic outcomes.
Heterogeneous Treatment Effects	How the effect varies across subgroups.	Whether a job-training program helps differently by age or education.
Panel Data and Fixed Effects	Repeated observations on the same units to absorb time-invariant confounders.	Education policy, tracked across years of student data.
Synthetic DiD (SDID)	Combines synthetic control and DiD.	A law’s effect, treated region vs synthetic control over time.

The assumptions that actually bite

Most causal estimates fall apart not in the math but in one of these three. They are usually untestable, which is exactly why they are worth writing down.

Assumption	What it means	Where it breaks
Ignorability / Exchangeability	Conditional on observed covariates, treatment is as good as random.	An unmeasured confounder you never recorded.
SUTVA	No interference between units, one version of treatment.	One person’s vaccination affecting another’s outcome.
Common Support / Overlap	Treatment and control overlap in covariate space.	A region where only treated units exist, forcing extrapolation.

Picking a method: the one thing that matters for each

When I reach for one of these, the question is never “what does it do” but “what does it need to be true.” Here is the load-bearing assumption behind each, the thing that, if violated, quietly makes the estimate wrong.

Method	Use it when	The assumption it lives or dies on
RCT	You can assign treatment yourself.	Randomization actually happened (no broken blinding, no attrition).
IV	Treatment is confounded but you have a clean instrument.	The instrument touches the outcome only through treatment. One alternate path kills it.
DiD	You have before/after data for a treated and a control group.	Parallel trends: the two groups would have moved together absent treatment.
RD	Treatment flips at a sharp threshold.	Units can’t precisely manipulate which side of the cutoff they land on.
Propensity Score Matching	Selection is on observables you measured.	You actually observed the confounders. It can’t fix unobserved ones.
Synthetic Control	One treated unit, many candidate controls, long pre-period.	The synthetic blend tracks the treated unit well before treatment.
Mediation	You want the pathway, not just the total effect.	No unmeasured confounding of the mediator-outcome link (the part people forget).
Panel / Fixed Effects	Repeated measures, confounders that don’t change over time.	The confounders really are time-invariant.
SDID	DiD’s parallel-trends looks shaky and you have a panel.	A weaker, reweighted version of parallel trends still holds.

Things I’ve learned to check before trusting a result

Plot the overlap. If treated and control don’t share covariate space, the model is extrapolating and nobody told you.
Stress the key assumption, not the data. More features won’t save a design whose identification is broken.
Run a sensitivity analysis. “How big would an unobserved confounder have to be to flip this?” is often more honest than the point estimate.
Be suspicious of a clean answer from messy observational data. It usually means an assumption is doing more work than you realize.

More causal inference notes:

Relationship of covariance and dot product

Mon, 31 Jan 2022 15:00:00 GMT

The Relationship Between Covariance and Dot Product

Covariance and the dot product are related concepts in mathematics and statistics, particularly in the context of vectors and random variables. Here’s how they are connected:

Dot Product

The dot product (also known as the scalar product) of two vectors and in -dimensional space is given by:

This operation results in a single scalar value and provides a measure of the similarity between the two vectors. If the vectors point in the same direction, the dot product is positive and large; if they point in opposite directions, the dot product is negative; if they are orthogonal, the dot product is zero.

Covariance

Covariance is a measure of how much two random variables and change together. It is given by:

Where denotes the expected value. If the covariance is positive, the variables tend to increase together; if it is negative, one tends to increase when the other decreases.

Relationship

The relationship between covariance and the dot product can be seen in the context of centered random variables. If you consider the centered random variables:

the covariance between and can be interpreted as the dot product of these centered variables in the space of random variables, normalized by the number of observations :

Here, and are the deviations from the mean of the random variables and respectively. This sum is analogous to the dot product, where the sum of the products of corresponding elements gives a measure of the overall relationship between the two sets of values.

Summary

In summary, covariance can be viewed as a normalized dot product of centered random variables. Both operations measure how two sets of values relate to each other, with the dot product being a geometric measure in vector space and covariance being a statistical measure in the space of random variables.

Deep dive into MLOps.

Thu, 13 May 2021 15:00:00 GMT

Motivation

Machine Learning (ML) Proof Of Concept (POC) is one of the key phases in ML projects. Iteratively coming up with a better modeling approach, improving data quality, and finally, delivering a minimum viable solution for a grand problem is fascinating and you learn a lot.

At the end of the POC phase, if the results are satisfactory and stakeholders agree to move forward, the ML engineer gets tasked with finding and implementing an appropriate method for deploying the model into production. Depending on the business use case and the scale of the project (and many other factors), he/she will usually select to proceed with one of the following (non-exhaustive):

Type 1: Save the trained model artifacts in the backend. Implement an inference logic that uses the trained model. The frontend engineer then requests the predictions from the backend via HTTP.
Type 2: Choose to use one of the cloud services (Amazon AWS, Google GCP, Microsoft Azure) for training the model. Transfer the training data to the cloud, train the model and deploy it in the cloud. Set up a cloud endpoint and return predictions via HTTP requests.
Type 3 (worst): Implement the training and the inference logic in the backend. At the same time, implement the data extraction, transformation and the loading (ETL) processes in the backend so that the system is able to retrain the model when needed. Try his/her best to setup monitoring, so that one day when things go wrong, he/she can find and fix the problem.
Type 4: Use the cloud services to setup everything (ETL, data preprocessing, model training, monitoring, configuration, etc.). The frontend engineer then requests predictions via HTTP endpoint provided by the cloud service.
Type 5: Use open-source packages, connect and build everything manually.

There can be many other options too. However the key message here is that ML deployment can take on many forms.

With ever increasing data and the expertise of the people dealing with the data, various ML models are being trained, tested, and deployed into production with unprecedented rates. However, with the increasing number of model deployments, ML practitioners have been experiencing a set of problems related to the unique property of ML-embedded systems - the usage of data. And this is exactly where MLOps comes to the rescue. Similar to the concept/culture of DevOps in the traditional software engineering discipline, the term MLOps refers to the engineering culture and best practices related to deploying and maintaining real-world ML systems.

There are many resources related to MLOps. Some of my favorites are:

“Hidden Technical Debt in Machine Learning Systems” - Google (Analyzed in this blog post.)
“Machine Learning Guides” - Google
“MLOps for non-DevOps folks, a.k.a. “I have a model, now what?!” - Hannes Hapke
“From Model-centric to Data-centric AI” - Andrew Ng
“Bridging AI’s Proof-of-Concept to Production Gap” - Andrew Ng
“MLOps: Continuous delivery and automation pipelines in machine learning” - Google
“What is ML Ops? Best Practices for DevOps for ML” - Kaz Sato
…and so on.

Perhaps the most popular resource on the above list is the paper called “Hidden Technical Debt in Machine Learning Systems” by Google.

When I first read the paper, I found it to be highly useful in practice and understood that I need to go over this paper whenever I am building ML solutions. So to make my life a bit easier, I have decided to summarize the paper and do it so using simple terms and explanations.

Key points:

Developing and deploying ML systems might seem fast and cheap, but maintaining it for the long term is difficult and expensive.
Goal of dealing with technical debt is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability.
Data influences ML system behavior. Technical debt in ML systems may be difficult to detect because it exists at the system level rather than the code level.

Complex Models Erode Boundaries

1. Entanglement

CACE principle:
CACE stands for Changing Anything Changes Everything. It refers to the entangled property of ML-systems, where changing the input distributions of a single feature can lead to changes in the rest of the features. CACE applies to input signals, hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak.
Possible solution #1:
If the problem can be decomposed into sub-problems (disjointed, uncorrelated), train models for each sub-problem and serve ensemble models. In many cases ensembles work well because the errors in the component models are uncorrelated. However, ensemble models lead to strong entanglement: improving an individual model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components.
Possible solution #2:
Monitor and detect changes in the prediction behavior in real-time. Use visualization tools to analyze the changes.

Example of feature entanglement: complex correlation of features.

2. Correction Cascades (chain reaction)

When we have a ready-to-use model m_a for problem A and need to train a new model for a problem A^‘, it makes sense to train a correction model m^’_a that takes as input the predictions of the model m_a.
This dependency can be cascaded further, for example training a model m^’’_aa based on the model m^’_a.
This dependency structury creates an improvement deadlock. When we try to improve the accuracy of a single model, other models are affected by the change causing system-level issues.
Possible solution #1:
Train and tune the model m_a directly for the problem A^’ by adding appropriate features specific to the problem A^’.
Possible solution #2:
Create a new model for the problem A^’.

Example of correction cascades.

3. Undeclared Consumers (Please let us know if you are using!)

Often, predictions of a model are used in other services.
If we don’t identify all end users of a model, later in the process, if we make changes to the model, the secret consumers will be affected silently and will face strange issues. Without clear definition of consumers, it becomes difficult and costly to make changes to the model at all.
In practice, engineers will choose to use the most accessible signal at hand (e.g. model predictions), especially when deadlines are approaching.
Possible solution:
Setup access restrictions and determine all consumers. Similarly, setting up service-level agreements (SLAs) would also work.

Example of an undeclared consumer.

Data Dependencies Cost More than Code Dependencies

1. Unstable Data Dependencies (Can I rely on you?)

ML-systems often consume data (input signals) produced by other systems.
Some data might be unstable (qualitatively or quantitatively changing over time).
For example:
1. Input data to system B is produced by a machine learning model from system A, and system A decides to update its model.
2. Input data comes from a data-dependent lookup table (e.g. calculates TF/IDF scores or semantic mappings).
3. Engineering ownership of the input signal is separate from the engineering ownership of the model that consumes it. Engineers who own the input signal can make changes to the data at any time. This makes it costly for the engineers who consume the data to analyze how the change affects their system.
4. Corrections in the input data can lead to detrimental consequences too! Similar to the CACE principle.
  For example:
  A model that is previously trained on mis-calibrated input signals can start to behave strangely when the input data is recalibrated. This means that although the update was meant to fix a problem, it actually introduced complications to the system.
Possible solution:
Create versioned copy of the input data. Using stable version of the data until the new data is checked and tested will ensure that the system is stable. ***Keep in mind that saving versioned copies of the data means we have to deal with potential data staleness and the cost of maintenance.

Example of data dependency between systems.

2. Underutilized Data Dependencies (How will this feature affect me in the future?)

Underutilized data dependency - data that has very small incremental benefit for the model (0.0001% improvement).
Example:
A system introduces new product numbering logic to the system. Since instantly removing the old product numbering logic will lead to disasters (since other components depend on it), engineers decide to keep the new and the old numbering logics for the time being. This way, new products use the new product numbering logic only, but old products have both the new and the old product numbers. Accordingly, the model is retrained using both features continuing to rely on old product numbers for some products. A year later, when the old numbering logic code is deleted, it will be difficult for the maintainers to find what went wrong.
Types of underutilized data dependency:
a) Legacy Features.
A feature F is included in a model early in its development. Over time, F is made redundant by new features but this goes undetected.
b) Bundled Features.
During an experiment, a group of features is evaluated and found to be useful. Because of deadline pressures, the group of features are added to the model together, possibly including features with low value.
c) ǫ-Features. As machine learning researchers, it is tempting to improve the model accuracy even when the accuracy gain is small and the complexity overhead is high.
d) Correlated Features. Often two features are strongly correlated, but one is more directly causal. Many ML methods have difficulty detecting this and credit the two features equally, or may even pick the non-causal one. When the correlations disappear in the future, the model will perform poorly.
Possible solution:
Leave-one-feature-out evaluations can be used to detect underutilized dependencies. Regularly performing this evaluation is recommended.

Example of underutilized data dependency. Feature J is an old feature similar to Feature K. In the future, when we remove Feature J modules that depend on it will cause secret failures.

3. Static Analysis of Data Dependencies (Data Catalogs!)

In traditional code, compilers and build systems perform static analysis of dependency graphs. Tools for static analysis of data dependencies are far less common, but are essential for error checking, tracking down consumers, and enforcing migration and updates.
Annotating data sources and features with metadata (e.g. deprecation, platform-specific availability, and domain-specific applicability), and setting up automatic checks to make sure all annotations are updated helps to resolve the dependency tree (users and systems who use the data).
This kind of tooling can make migration and deletion much safer in practice.
Google’s solution can be found here (Section 8: “AUTOMATED FEATURE MANAGEMENT”).

Example of data dependency management.

Feedback Loops

When it comes to live ML systems that update their behavior over time , it is difficult to predict how they will behave when they are released into production. This is called analysis debt - the difficulty of measuring the behavior of a model before deployment.

1. Direct Feedback Loops (Learning wrong things)

A model may directly influence the selection of its own future training data.
Example:
In an e-commerce website a model might recommend 10 different products to a new customer. The customer might choose a product category that he/she needs to purchase at the moment, but which is not related to his/her actual interests. The model captures this and retrains itself, learning the wrong interest and becoming confident that the chosen product category conveys the customer’s interest.
Possible Solution #1:
Acquiring customer feedback on the recommendations provided by the system.
Possible Solution #2:
Increasing “exploration” of the model so that it doesn’t “exploit” small number of signals (Bandit Algorithms).
Possible Solution #3:
Using some amount of randomization.

2. Hidden Feedback Loops (We were connected???)

Hidden Feedback Loops - two systems influence each other indirectly (in a hidden, difficult-to-detect manner).
Difficult to detect!
Improving one system may lead to changes in the behavior of the other.
Hidden feedback loops can also occur in completely disjoint systems.
Example:
Scenario of two investment firms when one firm makes changes to its bidding algorithm (improvements, bugs, etc.) and the other firm’s model catches this signal and changes its behavior too (might lead to disasters).

ML-System Anti-Patterns

In the real world, ML-related code takes a very small fraction of the entire system. Therefore, it is important to consider other parts of the system and make sure that there are as few system-design anti-patterns as possible.

Figure 1 in “Hidden Technical Debt in Machine Learning Systems”, Google.

1. Glue Code (Packages should be replaceable –> APIs)

Glue code system design - system with large portion of code written to support the specific requirements of general-purpose packages.
Glue code makes it difficult to test alternatives methods or make improvements in the future.
If ML code takes 5% and glue code takes 95% of the total code, it makes sense to implement a native solution rather than using general-purpose packages.
Possible Solution:
Wrap general-purpose packages into common APIs. This way we can reuse the APIs and not worry about changing packages (e.g. scikit-learn’s Estimator API with common fit(), predict() methods for all models).

2. Pipeline Jungles (Keep things organized)

Usually occurs during data preparation.
Adding a new data source to the data pipeline 1-by-1, gradually makes it difficult to make improvements and detect errors.
Possible Solution #1:
Think about data collection and feature extraction as a whole. Redesign the jungle pipeline from scratch. It might require a lot of effort, but is an investment worth the price.
Possible Solution #2:
Make sure researchers and engineers are not overly separated. Packages written by the researchers may seem like black-box algorithms to the engineers who use them. If possible, integrate researchers and engineers into the same team (or create hybrid engineers).

Icons from thenounproject.com. Integrating researchers and engineers into one team.

3. Dead Experimental Codepaths (Clean your room!)

Adding experimental code to the production system can cause issues in the long term.
It becomes difficult/impossible to test all possible interactions between codepaths. The system becomes overly complex with many experimental branches.
“A famous example of the dangers here was Knight Capital’s system losing $465 million in 45 minutes, because of unexpected behavior from obsolete experimental codepaths.” (source)
Possible Solution:
“Periodically reexamine each experimental branch to see what can be ripped out. Often only a small subset of the possible branches is actually used; many others may have been tested once and abandoned.”

Example of dead code.

4. Abstraction Depth (Can we abstract this?)

Relational database is a successful example of abstraction.
In ML it is difficult to come up with robust abstraction (how to create abstractions for a data stream, a model and a prediction?).
Example abstraction:
Parameter-server abstraction - framework for distributed ML problems. “Data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices.”

5. Common Smells (Something’s fishy…)

“Code Smells are patterns of code that suggest there might be a problem, that there might be a better way of writing the code or that more design perhaps should go into it.” (source)
Plain-Old-Data Type Smells.
Check what flows in and out of ML systems. The data (signal, features, information, etc.) flowing in and out of ML systems is usually of type integer or float. “In a robust system, a model parameter should know if it is a log-odds multiplier or a decision threshold, and a prediction should know various pieces of information about the model that produced it and how it should be consumed.”
Multiple-Language Smell.
Using multiple programming languages, no matter the great packages written in it, makes it difficult to efficiently test and, later, transfer the ownership to others.
Prototype Smell.
Constantly using prototype environments is an indication that the production system is “brittle, difficult to change, or could benefit from improved abstractions and interfaces”.
This leads to 2 potential problems:
- Usage of the prototype environment as the production environment due to time pressures.
- Prototype (small scale) solutions rarely reflect the reality of full-scale systems.

Configuration Debt

ML systems come with various configuration options:
1. Choice of input features
2. Algorithm-specific learning configurations (hyperparameters, number of nodes, layers, etc.)
3. Pre- & post-processing methods
4. Verification methods

In full-scale systems, configuration code far exceeds the number of lines of traditional code. Mistakes in configuration can lead to loss of time, waste of computing resources, and other production issues. Thus, verifying and testing configurations is crucial.

Example of difficult-to-handle configurations (Choice of input features):
1. Feature A was incorrectly logged from 9/13-9/17.
2. Feature B isn’t available before 10/7.
3. Feature C has to have 2 different acquisition methods due to changes in logging.
4. Feature D isn’t available in production. Substitute feature D’ must be used.
5. Feature Z requires extra training memory for the model to train efficiently.
6. Feature Q has high latency, so it makes Feature R also unusable.

Configurations must be:
1. Easy to change. It should seem as making a small change to the previous configuration.
2. Difficult to make manual errors, omissions, or oversights.
3. Easy to analyze and compare with the previous version.
4. Easy to check (basic facts and details).
5. Easy to detect unused or redundant settings.
6. Code reviewed and maintained in a repository.

Dealing with the Changes in the External World

ML systems often directly interact with the ever-changing real world. Due to volatility in the real world, ML systems require continuous maintenance.

1. Fixed Thresholds in Dynamic Systems

For problems that predict whether a given sample is true or not it is required to set a decision threshold.
The classic approach is to choose a decision threshold that maximizes a certain metric (precision, recall, etc.).
When a model is retrained on a new data, the previous decision threshold usually becomes invalid.
Manually calculating thresholds is time consuming, so automatic optimization method is required.
Optimal approach is to extract a held-out validation set and use it to automatically calculate the optimal thresholds.
Approach used by the Google Data Scientists can be found in the “Unofficial Google Data Science” blog post.
My simple approach dealing with decision thresholds - blog post.

2. Monitoring and Testing

Unit tests and end-to-end tests of systems are valuable. However, in the real-world, these tests are simply not enough.
Realtime monitoring with automated response to issues is essential for long-term system reliability.
What should be monitored:
a) Prediction Bias
Is the distribution of predicted and observed labels equal? Although this test is not enough to detect a model that simply outputs average values of label occurrences without regard to the input features, it is surprisingly effective.
For example:
- Real-world behavior changes and, now, the training data distributions are not reflective of reality. In this case, analyzing predictions for various dimensions (e.g. based on race, gender, age, etc.) can isolate biases. Further, we can setup automated alerting in case we detect prediction bias.
  
  b) Action Limits
  In real-world ML systems that take actions such as bidding on items, classifying messages as spam, it is useful to set action limits as a sanity check. “If the system hits a limit for a given action, automated alerts should fire and trigger manual intervention or investigation.”
  For example:
- Setting maximum number of bids per hour
- Setting maximum ratio of spams to regular emails to be 3/5
  
  c) Up-Stream Producers
  When data comes from up-stream producers, the up-stream producers need to be monitored, tested, and routinely meet objectives that take into account the downstream ML system. Most importantly, any issues in the up-stream must be propagated to the downstream ML system. And the ML system must also notify its downstream consumers.
Issues that occur in real-time and have impact on the system should be dealt with automatic measures. Human intervention can work, but if the issue is time-sensitive it won’t work. Automatic measures are worth the investment.

Conclusion

Speed is not evidence of low technical debt or good practices because the real cost of debt becomes apparent over time.
Useful questions for detecting ML technical debt:
- How easily can an entirely new algorithmic approach be tested at full scale?
- What is the transitive closure of all data dependencies? Transitive closure gives you the set of all places you can get to, from any starting place.
- How precisely can the impact of a new change to the system be measured?
- Does improving one model or signal degrade others?
- How quickly can new members of the team be brought up to speed?
Key areas that will help reduce technical debt:
- Maintainable ML
- Better abstractions
- Testing methodologies
- Design patterns
Engineers and researchers both need to be aware of technical debt. “Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice.”
Dealing with technical debt can only be achieved by a change in team culture. “Recognizing, prioritizing, and rewarding this effort is important for the long term health of successful ML teams.”

And this is it!

In this post, I have tried to summarize the key points of the prominent paper “Hidden Technical Debt in Machine Learning Systems” by Google.

“The Thinker” - by Avery Evans, unsplash.com

Data engineering: simple and complex data pipelines

Mon, 12 Apr 2021 15:00:00 GMT

Most of my previous work consisted of various data analysis and ML-related tasks.

As of recently, I have been working on tasks related to data engineering, so I have decided to learn more about it. I have stumbled upon Chris Riccomini’s talk @QConSanFrancisco and have learned quite a few terminologies and concepts.

In this blog I would like to summarize the key points from Chris’ talk.
All credits go to Chris Riccomini (Link to the talk, Linkedin, Twitter).

What is the role of a data engineer?
- Data engineer’s job is to help an organization move (streaming or data pipelines) and process (data warehouses (DWH)) data.

Data engineers build tools, infrastructure, frameworks and services.

Data engineering is much closer to software engineering than it is to a data science.

- “The rise of the data engineer”, Maxime Beauchemin, Preset.

Data engineers are primarily involved with building data pipelines.

There are 6 stages in an organization’s data pipeline:
1. None
2. Batch
3. Realtime
4. Integration
5. Automation
6. Decentralization

1. None (Monolith DB)

Structure:
1. Single large DB
2. Users access the same DB

Pros/cons:

Pros:
- Simple
Cons:
- Queries time out
- Users impact each other
- MySQL doesn’t have complex SQL functions
- Report generation are broken

2. Batch (DWH + Scheduler)

Structure:
1. In-between the user and the DB we put a DWH
2. To get data from the DB to the DWH, we put a scheduler to periodically suck the data in.

Pros/cons:

Pros:
- setup is quick
- best for a basic setup
Cons:
- Large number of Airflow jobs are difficult to maintain
- create_time, modify_time issues arise
- DB Admin’s operations impact the pipeline
- Hard deletes don’t propagate
- MySQL replication latency (the amount of time it takes for a transaction that occurs in the primary database to be applied to the replicate database) causes data quality issues
- Periodic loads cause occasional MySQL timeouts

Transition to Realtime if:
1. Loads are taking too long
2. Pipelines are no longer stable
3. Many complicated workflows
4. Data latency (the time it takes for data to travel from one place to another) is becoming issue
5. Data engineering is your full-time job
6. Your organization uses Apache Kafka (stream processing tool that provides a unified, high-throughput, low-latency platform for handling real-time data feeds.)

3. Realtime (Kafka)

Structure:
1. Change Airflow to Debezium (Tool for change data capture. Start it up, point it at your data sources, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases.)

Change Data Capture is github for DB changes.

Debezium data sources: MongoDB, MySQL, PostgreSQL, SQL Server, Oracle, Cassandra

2. Operational complexity has gone up

3. KCBQ - Kafka Connects to BigQuery (takes data from Kafka and uploads it into BigQuery)

Transition to Integration if:
1. You have many microservices
2. You have a diverse DB ecosystem
3. You have a team of data engineers
4. You have a mature SRE organization (SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks)

4. Integration (Advanced topic)

Structure:
1. Services with DB
2. Streaming platform (Kafka) and DWH
3. Different types of DBs (NoSQL, NewSQL, GraphDB…)

Why do we need such a complex pipeline?

Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of connected users of the system (the value of a network increases with more nodes and edges you add into it).

Twotelephonescan make only oneconnection, five can make 10 connections, and twelve can make 66 connections.

Pros/cons:

Pros:
- If you would like to test a new realtime system, it becomes relatively easy to do so. –> Because your data is very portable.
- Easy to switch Cloud vendors
- Improves infrastructure agility. Easy to plug-in a new system, feed with data and test it.
Cons:
- Add/create/configure/grant/deploy manual work persists!
- Manual work eats up time…

Transition to Automation if:
1. Your SREs can’t keep up
2. Manual work is taking a lot of time

5. Automation

Structure:
1. Automated Data Management added
2. Automated Operations added

a) Automated Data Management

Automation helps with data management:
1. Who gets access to the data once it is loaded ?
2. How long can the data exist (persist only 3 years –> removed)?
3. Is this data allowed in this system (sensitive information)?
4. Which geographies must the data persist in?
5. Should columns be masked, redacted?

One of the most redundant tasks of data management is creating data catalogs.

Contents of a data catalog:
- Location of the data
- Data schema information
- Who owns the data
- Lineage (where the data came from)
- Encryption information (which parts of the data are masked, encrypted)
- Versioning information

Example data catalog by Lyft’s Amundsen tool:

The key point here is that we don’t want to manually input data into the data catalog.

Instead, we should be hooking up our systems to different data catalog generators since they can automatically generate the metadata (schema, ownership, evolution, etc.).

b) Automated Operations:

User management automations:
1. New user access
2. New data access
3. Service account access
4. Temporary access
5. Unused access

Detecting violations via automations:
1. Auditing
2. Data loss prevention (GCP Data Loss Prevention (DLP))

For example, we can run DLP checks to detect whether sensitive information (phone number, SSN, email, etc.) exists in the data or not. This protects us from violating regulations.

Even after automating all of the above, data engineers still have to configure and deploy.

Transition to Decentralization if:
1. You have a fully automated realtime data pipeline
2. People still ask the data engineers to load some data

6. Decentralization

Structure:
1. Multiple DWHs
2. Different groups administer and manage their own DWH
3. From monolith to micro-warehouses

What is likely to be considered a full data pipeline Decentralization?
1. Polished tools are exposed to everyone
2. Security and compliance manage the access and policies
3. Data engineers manage data tools and infrastructure
4. Everyone manages data pipelines and DWHs

Conclusion:

Modern Data Pipeline structures by Chris Riccomini:
1. Realtime data integration
2. Streaming platforms
3. Automated data management
4. Automated operations
5. Decentralized DWHs and pipelines

As a final remark, I would like to say that I truly enjoyed Chris’ speech.

It made me truly appreciate all the hard work put by data engineers, not mentioning the complexity of bits and pieces.

For anyone who is reading this post, I highly recommend to go and watch Chris’ talk.

Cheers and stay safe!

Takeaways from Kaggle’s “Jane Street Market Prediction” competition

Fri, 12 Mar 2021 15:00:00 GMT

Recently, I have spent my evenings participating in the Kaggle’s “Jane Street Market Prediction” competition.

To preserve the know-how acquired from the competition, I have written this blog post.

If you would like to learn more about the competition click here.

Anonymized data set know-how
Cross-validation know-how
Neural network optimization with Keras-tuner
Keras fast inference know-how

Anonymized data set know-how

Since we have been provided with an anonymized data set, it was difficult to perform EDA.

The specific data engineering methodology of Jane Street is one of their primary assets and that is why they have anonymized their data set. Hence, I didn’t spend much time on de-anonymization, considering it was unethical.

However, for those who are interested in how to start de-anonymizing a data set, here are example de-anonymization notebooks shared by Gregory Calvez during the competition:

The author of the notebook explains his mindset throughout the notebook in clear and concise comments.

Cross-validation know-how

We have been given 2 years of time-series data.

To avoid information leakage, we couldn’t randomly split the data into cross-validation splits.

The main cross-validation methodology used in the competition was referred to as “Purged Group Time Series Split”. Here is how we split the data:

Credits to https://eng.uber.com/omphalos/

The intuition is that we iteratively expand the training data size over the entire history of a time series and repeatedly test against a forecasting window, without dropping older data points (Learn more about the method here).

Here is my implementation of the “Purged Group Time Series Split”:

import pandas as pd
import numpy as np

def custom_cv_split(df, n_splits=4, date_column_name):
    date_splits = np.array_split(df[date_column_name].unique(), n_splits)
    train_ixs = []
    test_ixs = []
    for split in date_splits[:-1]:
        if len(train_ixs) > 0:
            curr_tr_ix = train_ixs[-1].copy()
            curr_tr_ix.extend(df[df[date_column_name].isin(split)].index.values.tolist())
        else:
            curr_tr_ix = df[df[date_column_name].isin(split)].index.values.tolist()
        train_ixs.append(curr_tr_ix)
    for split in date_splits[1:]:
        curr_ts_ix = df[df[date_column_name].isin(split)].index.values.tolist()
        test_ixs.append(curr_ts_ix)
        
    for i in train_ixs:
        print("Training size: ", len(i), i[:10], i[-10:])
    print()    
    for i in test_ixs:
        print("Test size: ", len(i), i[:10], i[-10:])
    return list(zip(train_ixs, test_ixs))

If you face any bugs, have fun debugging it. I am sure it will help you understand the code!

Neural network optimization with Keras-tuner

Keras-tuner package has made it easy and intuitive to tune neural network parameters! Their documentation is also quite simple.

However, when it comes to using custom cross-validation sets, we have to apply a small trick by creating and modifying the keras-tuner “kt.engine.tuner.Tuner” class:

class CVTuner(kt.engine.tuner.Tuner):
    def run_trial(self, trial, X, y, splits, batch_size=32, epochs=1,callbacks=None):
        val_losses = []
        for train_indices, test_indices in splits:
            X_train, X_test = [x[train_indices] for x in X], [x[test_indices] for x in X]
            y_train, y_test = [a[train_indices] for a in y], [a[test_indices] for a in y]
            if len(X_train) < 2:
                X_train = X_train[0]
                X_test = X_test[0]
            if len(y_train) < 2:
                y_train = y_train[0]
                y_test = y_test[0]
            
            model = self.hypermodel.build(trial.hyperparameters)
            hist = model.fit(X_train,y_train,
                      validation_data=(X_test,y_test),
                      epochs=epochs,
                        batch_size=batch_size,
                      callbacks=callbacks)
            
            val_losses.append(np.max([hist.history[k]) for k in hist.history]) #change this based on your metric, we use auc --> so we want to get maximum value
        val_losses = np.asarray(val_losses)
        self.oracle.update_trial(trial.trial_id, {k:np.mean(val_losses[:,i]) for i,k in enumerate(hist.history.keys())})
        self.save_model(trial.trial_id, model)

Make sure you understand each line, so that you can customize it later for your own use case.

As a next step, we can directly use our “CVTuner” class as:

tuner = CVTuner(
    hypermodel=model_fn, #keras model definition function
    oracle=kt.oracles.BayesianOptimization(
        objective= kt.Objective('val_loss', direction='min'),
        num_initial_points=4,
        max_trials=10,
        seed=SEED
        )
)
        
tuner.search(
    (X,), 
    (X,y), 
    splits=splits, #here we pass the CV folds
    batch_size=4096,
    epochs=100,
    callbacks=[EarlyStopping('val_loss',patience=5)]
)

And that is it!

With the help of the keras-tuner package, we can setup the hyperparameter tuning in only a few lines of code.

Keras fast inference know-how

To simulate a real-world High-Frequency-Trading (HFT) prediction scenario, the organizers have set a time limit of ~60 predictions/second.

This made heavy feature engineering almost impossible, leading us to focus on methods that improve the prediction/second rate.

One of the interesting methods was to make our Keras NN model inference time ~3-4 times faster by making our model “LiteModel”:

class LiteModel:
    
    @classmethod
    def from_file(cls, model_path):
        return LiteModel(tf.lite.Interpreter(model_path=model_path))
    
    @classmethod
    def from_keras_model(cls, kmodel):
        converter = tf.lite.TFLiteConverter.from_keras_model(kmodel)
        tflite_model = converter.convert()
        return LiteModel(tf.lite.Interpreter(model_content=tflite_model))
    
    def __init__(self, interpreter):
        self.interpreter = interpreter
        self.interpreter.allocate_tensors()
        input_det = self.interpreter.get_input_details()[0]
        output_det = self.interpreter.get_output_details()[0]
        self.input_index = input_det["index"]
        self.output_index = output_det["index"]
        self.input_shape = input_det["shape"]
        self.output_shape = output_det["shape"]
        self.input_dtype = input_det["dtype"]
        self.output_dtype = output_det["dtype"]
        
    def predict(self, inp):
        inp = inp.astype(self.input_dtype)
        count = inp.shape[0]
        out = np.zeros((count, self.output_shape[1]), dtype=self.output_dtype)
        for i in range(count):
            self.interpreter.set_tensor(self.input_index, inp[i:i+1])
            self.interpreter.invoke()
            out[i] = self.interpreter.get_tensor(self.output_index)[0]
        return out
    
    def predict_single(self, inp):
        """ Like predict(), but only for a single record. The input data can be a Python list. """
        inp = np.array([inp], dtype=self.input_dtype)
        self.interpreter.set_tensor(self.input_index, inp)
        self.interpreter.invoke()
        out = self.interpreter.get_tensor(self.output_index)
        return out[0]

And finally, transform our model using the “TFLite” class:

model = LiteModel.from_keras_model(model)
model.predict(X_test)

By doing this simple step, we have been able to drastically increase our predictions/second. Approximately 3-4 times!!!

However, one thing to note here is that since “TFLite” is an optimization method, there might be slight differences in the performance of the transformed model.

Despite this withdrawal, according to my experiments, I did not see a huge drop in the performance, so I used “TFLite” transformation without worrying too much.

Conclusion

To conclude, I have learned many valuable techniques by participating in the “Jane Street Market Prediction” competition.

The winners of the competition will be announced after 6 months. Until then, throughout these 6 months, our models are tested against real-time market data.

Ultimately, no matter if you win a medal or not, participating in Kaggle competitions is surely the best way to acquire up-to-date State Of The Art (SOTA) ML modeling knowledge!

Therefore, for anyone who is interested in Data Science, Machine Learning or AI in general, I highly encourage you to try to Kaggle.

Thank you for reading and happy Kaggling!

Not so simple classification.

Sat, 13 Feb 2021 15:00:00 GMT

Resources on the internet consider binary classification as a relatively straightforward problem.

I have had the opportunity to work on a Proof-Of-Concept (POC) project, where we had to predict if a user will visit the campaign website given a set of features about the user.

My previous understanding about binary classification has changed dramatically after the experiments. Hopefully, your perspective about binary classification will change too.

Conducting variety of experimentations using our data set, we have found several practical techniques that might be extremely useful for the reader.

Specifically, we have found the answer to 3 critical questions:

How to train a model when only 1% of the samples are positive samples?
How to choose the decision threshold of a classification model?
How to evaluate a model, so that the client can easily understand the capabilities of the model?

1. How to train a model when only 1% of the samples are positive samples?

Before diving deep, let’s train a classifier and calculate our baseline scores!

First, let’s create a simple data set:

import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100000, 
                           n_features=30, 
                           n_redundant=0,
                           n_clusters_per_class=2, 
                           weights=[0.98], 
                           flip_y=0, 
                           random_state=12345)

By setting weights to 0.98, we are creating an imbalanced data set

For ease of use, let’s convert the arrays into a DataFrame:

cols = ["feature_"+str(i) for i in range(1, X.shape[1]+1)]
df = pd.DataFrame.from_records(X, columns=cols)
df["Class"] = y

Let’s split the data into train/test sets to evaluate our models on a common test set:

pos_samples = df[df["Class"]==1].sample(frac=1) #extracting positive samples
neg_samples = df[df["Class"]==0].sample(frac=1) #extracting negative samples

# let's set aside 500 positive and 500 negative samples
test = pd.concat((pos_samples[:500], neg_samples[:500]), axis=0).sample(frac=1).reset_index(drop=True) #combine, shuffle, reset indices

# let's use the rest for training
train = pd.concat((pos_samples[500:], neg_samples[500:]), axis=0).sample(frac=1).reset_index(drop=True) #combine, shuffle, reset indices

Now let’s train a model using the training set and evaluate on the test set (for simplicity, I am going to use sklearn’s “RandomForestClassifier”):

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=1000, #the more the better, but slower
                               random_state=12345, #lucky number
                               class_weight="balanced",
                               verbose=2,
                               n_jobs=-1).fit(train.values[:,:-1], train["Class"])

Make sure, the “class_weight” parameter is set to “balanced”, since we are dealing with an imbalanced classification problem!

Since we are working with an imbalanced data, let’s use sklearn’s “balanced_accuracy_score”, “classification_report” and “plot_confusion_matrix” functionalities to evaluate our model:

from sklearn.metrics import classification_report, balanced_accuracy_score, plot_confusion_matrix

preds = model.predict(test.values[:,:-1])
print(classification_report(test["Class"], preds))
print("Accuracy: {}%".format(int(balanced_accuracy_score(test["Class"], preds)*100)))

output:

              precision    recall  f1-score   support

           0       0.74      1.00      0.85       500
           1       1.00      0.64      0.78       500

    accuracy                           0.82      1000
   macro avg       0.87      0.82      0.82      1000
weighted avg       0.87      0.82      0.82      1000

Accuracy: 82%

We can predict our target with 82% accuracy, not bad!
Let’s take a look at the confusion matrix:

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 15})

fig, ax = plt.subplots(figsize=(5, 5))
_=plot_confusion_matrix(model, test.values[:,:-1], test["Class"], values_format = '.0f', cmap=plt.cm.Blues, ax=ax)

output:

The baseline model seems to incorrectly predict positive samples as negative samples…Let’s try to improve our results!

Bagging

What is Bagging?
Bagging is a sampling method, commonly used in ensemble learning. It means splitting the training data into several subsets and training a model on each subset. After training the models, each model generates a prediction for a sample and all predictions are averaged to produce the prediction.

How to choose the number of splits?
There is no set value for the number of splits. However, based on how many positive samples there are, I usually choose 5-10 subsets. (Split as long as there is improvement)

pos_samples = train[train["Class"]==1].sample(frac=1)
neg_samples = train[train["Class"]==0].sample(frac=1)

#lets split into 5 bags
train_1 = pd.concat((pos_samples[:300], neg_samples[:3000]), axis=0)
train_2 = pd.concat((pos_samples[300:600], neg_samples[3000:6000]), axis=0)
train_3 = pd.concat((pos_samples[600:900], neg_samples[6000:9000]), axis=0)
train_4 = pd.concat((pos_samples[900:1200], neg_samples[9000:12000]), axis=0)
train_5 = pd.concat((pos_samples[1200:], neg_samples[12000:15000]), axis=0)

Train 5 models for 5 splits:

bag_1 = RandomForestClassifier(n_estimators=1000, random_state=12345, class_weight="balanced", n_jobs=-1).fit(train_1.values[:,:-1], train_1["Class"])
bag_2 = RandomForestClassifier(n_estimators=1000, random_state=12345, class_weight="balanced", n_jobs=-1).fit(train_2.values[:,:-1], train_2["Class"])
bag_3 = RandomForestClassifier(n_estimators=1000, random_state=12345, class_weight="balanced", n_jobs=-1).fit(train_3.values[:,:-1], train_3["Class"])
bag_4 = RandomForestClassifier(n_estimators=1000, random_state=12345, class_weight="balanced", n_jobs=-1).fit(train_4.values[:,:-1], train_4["Class"])
bag_5 = RandomForestClassifier(n_estimators=1000, random_state=12345, class_weight="balanced", n_jobs=-1).fit(train_5.values[:,:-1], train_5["Class"])

How to combine the predictions of the models?
We can use the “predict_proba” function of our “RandomForestClassifier” model to get the probabilities of each sample belonging to a specific class!

probs_1 = bag_1.predict_proba(test.values[:,:-1])[:,1]
probs_2 = bag_2.predict_proba(test.values[:,:-1])[:,1]
probs_3 = bag_3.predict_proba(test.values[:,:-1])[:,1]
probs_4 = bag_4.predict_proba(test.values[:,:-1])[:,1]
probs_5 = bag_5.predict_proba(test.values[:,:-1])[:,1]

Let’s evaluate our models:

probs = (probs_1+probs_2+probs_3+probs_4+probs_5)/5
preds = [1 if prob >= 0.5 else 0 for prob in probs]
print(classification_report(test["Class"], preds))
print("Accuracy: {}%".format(int(balanced_accuracy_score(test["Class"], preds)*100)))

output:

              precision    recall  f1-score   support

           0       0.76      1.00      0.87       500
           1       1.00      0.69      0.82       500

    accuracy                           0.85      1000
   macro avg       0.88      0.85      0.84      1000
weighted avg       0.88      0.85      0.84      1000

Accuracy: 84%

2% increase in accuracy…

Let’s take a look at the confusion matrix:

from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(test["Class"], preds)
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g', cmap=plt.cm.Blues)

# labels, title and ticks
ax.set_xlabel('Predicted labels');ax.set_ylabel('True labels')
ax.xaxis.set_ticklabels(['0', '1']); ax.yaxis.set_ticklabels(['1', '0'])
_=plt.tight_layout()

output:

Some of the predictions have improved…But false negatives are still high!

Now, you might be wondering if Bagging is worth it or not.
Here comes the interesting part…

2. How to choose the decision threshold of a model?

The advantage of the Bagging method is observed when we choose optimal decision thresholds for our classifiers.

But what is a decision threshold?
For example, in a binary classification problem with class labels 0 and 1, with predicted probabilities and a decision threshold of 0.5, the predicted probabilities less than the threshold of 0.5 are assigned to class 0 and values greater than or equal to 0.5 are assigned to class 1.

Prediction < 0.5 = Class 0
Prediction >= 0.5 = Class 1

Why do we need to optimize the decision thresholds?
https://stats.stackexchange.com/questions/312119/reduce-classification-probability-threshold

Okay let’s start!

Baseline model with optimized decision thresholds:

One of the easy ways to optimize decision thresholds, is to simply iterate over all possible decision thresholds:

dts = [i/100 for i in range(10, 100, 5)]
accs = []
for dt in dts:
    probs = model.predict_proba(test.values[:,:-1])[:,1]
    preds = [1 if prob >= dt else 0 for prob in probs]
    acc = balanced_accuracy_score(test["Class"], preds)
    accs.append(acc)

Let’s plot the accuracies for each decision threshold:

fig=plt.figure(figsize=(12, 5))
_=plt.plot(accs)
_=plt.xticks([i for i in range(len(dts))], dts)
_=plt.grid()
_=plt.tight_layout()
_=plt.xlabel("Decision thresholds")
_=plt.ylabel("Accuracies")

output:

We can see that decision threshold of 0.1 (very small) yields the best accuracy!

Let’s evaluate:

probs = model.predict_proba(test.values[:,:-1])[:,1]
preds = [1 if prob >= 0.1 else 0 for prob in probs]
print(classification_report(test["Class"], preds))
print("Accuracy: {}%".format(int(balanced_accuracy_score(test["Class"], preds)*100)))

output:

              precision    recall  f1-score   support

           0       0.83      0.99      0.91       500
           1       0.99      0.80      0.89       500

    accuracy                           0.90      1000
   macro avg       0.91      0.90      0.90      1000
weighted avg       0.91      0.90      0.90      1000

Accuracy: 89%

Baseline model’s accuracy has improved from 82% to 89%! Great!
How about the confusion matrix?

cm = confusion_matrix(test["Class"], preds)
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g', cmap=plt.cm.Blues)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.xaxis.set_ticklabels(['0', '1']); ax.yaxis.set_ticklabels(['1', '0'])
_=plt.tight_layout()

output:

Looks like we are getting somewhere. False negative have decreased!

Let’s try the same optimization for the Bagging method!

Bagged models with optimized decision thresholds:

dts = [i/100 for i in range(10, 100, 5)]
accs = []
for dt in dts:
    probs_1 = bag_1.predict_proba(test.values[:,:-1])[:,1]
    probs_2 = bag_2.predict_proba(test.values[:,:-1])[:,1]
    probs_3 = bag_3.predict_proba(test.values[:,:-1])[:,1]
    probs_4 = bag_4.predict_proba(test.values[:,:-1])[:,1]
    probs_5 = bag_5.predict_proba(test.values[:,:-1])[:,1]
    probs = (probs_1+probs_2+probs_3+probs_4+probs_5)/5
    preds = [1 if prob >= dt else 0 for prob in probs]
    acc = balanced_accuracy_score(test["Class"], preds)
    accs.append(acc)

Let’s plot the results:

fig=plt.figure(figsize=(12, 5))
_=plt.plot(accs)
_=plt.xticks([i for i in range(len(dts))], dts)
_=plt.grid()
_=plt.tight_layout()
_=plt.xlabel("Decision thresholds")
_=plt.ylabel("Accuracies")

output:

Similar to the baseline model, 0.1 seems to be the optimal decision threshold for maximizing accuracy!

Let’s evaluate:

probs_1 = bag_1.predict_proba(test.values[:,:-1])[:,1]
probs_2 = bag_2.predict_proba(test.values[:,:-1])[:,1]
probs_3 = bag_3.predict_proba(test.values[:,:-1])[:,1]
probs_4 = bag_4.predict_proba(test.values[:,:-1])[:,1]
probs_5 = bag_5.predict_proba(test.values[:,:-1])[:,1]
probs = (probs_1+probs_2+probs_3+probs_4+probs_5)/5
preds = [1 if prob >= 0.1 else 0 for prob in probs]
print(classification_report(test["Class"], preds))
print("Accuracy: {}%".format(int(balanced_accuracy_score(test["Class"], preds)*100)))

output:

              precision    recall  f1-score   support

           0       0.88      0.96      0.92       500
           1       0.96      0.86      0.91       500

    accuracy                           0.91      1000
   macro avg       0.92      0.91      0.91      1000
weighted avg       0.92      0.91      0.91      1000

Accuracy: 91%

We have reached 91% accuracy using Bagging!

How about the confusion matrix?

cm = confusion_matrix(test["Class"], preds)
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt='g', cmap=plt.cm.Blues)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.xaxis.set_ticklabels(['0', '1']); ax.yaxis.set_ticklabels(['1', '0'])
_=plt.tight_layout()

output:

We have managed to improve our model’s false negative predictions quite significantly!

Before, it was difficult to judge the benefits of the Bagging method.
However, once we have performed a simple decision threshold optimization routine, we are able to see the advantage of using the Bagging sampling method!

There is still room for improvement, please try to improve the accuracy ?.

***If you are interested in optimizing the decision threshold of a multi-class classifier, please let me know! I have a working solution.

3. How to evaluate a model, so that the client can easily understand the capabilities of the model?

Last, but not least, let’s talk about the interpretability of our results.

It is common for our clients to not have prior knowledge of ML evaluation methods. However, they would like to know if our models are robust or not.
How do we solve this problem?

For binary classification problems, here is a solution:
In the production environment, our classifier is most likely to see different types of data with different distributions per class. Therefore, we can evaluate the capability of our models by simulating different input distributions!

Example:

from sklearn.metrics import accuracy_score
from collections import Counter
import numpy as np

init=1
rows = []
for i in range(11):
    neg = test[test["Class"]==0].reset_index(drop=True)
    pos = test[test["Class"]==1].reset_index(drop=True)
    neg_sample = neg[:int(neg.shape[0]*init)]
    pos_sample = pos[:int((((neg.shape[0]*init)*1)/init)*(1-init))]
    combined = pd.concat((neg_sample, pos_sample), axis=0)
    probs_1 = bag_1.predict_proba(combined.values[:,:-1])[:,1]
    probs_2 = bag_2.predict_proba(combined.values[:,:-1])[:,1]
    probs_3 = bag_3.predict_proba(combined.values[:,:-1])[:,1]
    probs_4 = bag_4.predict_proba(combined.values[:,:-1])[:,1]
    probs_5 = bag_5.predict_proba(combined.values[:,:-1])[:,1]
    probs = (probs_1+probs_2+probs_3+probs_4+probs_5)/5
    preds = [1 if prob >= 0.1 else 0 for prob in probs]
    acc = accuracy_score(preds, combined["Class"])
    neg_preds = (Counter(preds)[0]/len(preds))*100
    pos_preds = (Counter(preds)[1]/len(preds))*100
    rows.append([np.ceil(init*100), np.ceil((1-init)*100), neg_preds, pos_preds, acc])
    init = init - 0.1
    
df = pd.DataFrame.from_records(rows, columns=["Actual - distribution", "Actual + distribution", "Predicted - distribution", "Actual + distribution", "Accuracy"])

Let’s take a look at the results:

output:

If we take a look at the first 2 columns in the table, we can see that the model is being evaluated on different input distributions (values are in percentages %).

Moreover, our model is able to accurately predict the positive and negative distributions (3rd and 4th columns, values are in percentages %)!

If we take a look at the “Accuracy” column and analyze from top to bottom, we can see that our model has a bias toward predicting negative samples. However, the overall accuracy seems to be quite high!

Isn’t the above table too complex?
We can perform cross-validation on the test data and calculate the accuracy with 95% confidence interval!

X_test = test[cols]
y_test = test["Class"]

from sklearn.model_selection import KFold

cv = KFold(n_splits=10, random_state=12345, shuffle=True)
accs = []
for train_index, test_index in cv.split(X_test, y_test):
    xtest, ytest = X_test.iloc[test_index], y_test.iloc[test_index]
    probs_1 = bag_1.predict_proba(xtest)[:,1]
    probs_2 = bag_2.predict_proba(xtest)[:,1]
    probs_3 = bag_3.predict_proba(xtest)[:,1]
    probs_4 = bag_4.predict_proba(xtest)[:,1]
    probs_5 = bag_5.predict_proba(xtest)[:,1]
    probs = (probs_1+probs_2+probs_3+probs_4+probs_5)/5
    preds = [1 if prob >= 0.15 else 0 for prob in probs]
    acc = balanced_accuracy_score(ytest, preds)
    accs.append(acc)
    
accs_ = np.array(accs) 
print("Accuracy: %0.2f (+/- %0.2f)" % (accs_.mean(), accs_.std() * 2))

Accuracy: 0.91 (+/- 0.05)

Using this single value for accuracy, the client can easily understand the overall capability of our model!

Conclusion:

In this post, I have tried to introduce an ensemble sampling method called “Bagging”
In the beginning, it was difficult to assess the advantage of using the Bagging method. However, once we have optimized our decision thresholds, we were able to see the benefits
It is common for our clients to require interpretable evaluation metrics. Therefore, I have shared a simple evaluation method for explaining classification models

Applied Data Science & ML

Changes to my workflow

API That Went Through Five Revisions

Writing Math Proofs

Thousands of Cloud Training Jobs

Where It Goes Wrong

Delegation as the Core Skill

Going down a random rabbit hole: From XML Tags to $100M Weight Updates

1. The XML “attention fence”

2. Phase 1: supervised fine-tuning as behavioral cloning

3. Phase 2: RLAIF and PPO, the optimization stage

4. Scaling: the infrastructure bill

Estimating the Distribution of Omitted Variable Bias in Causal Inference

What OVB actually is

The algebra

Five ways to put a range on the bias

How they compare

What I took away

References

When Linear Regression Gets Massively Confused

What a mass point is

Why linear models struggle with it

Models that handle it better

Takeaway

Appendix: why heteroscedasticity matters

Causal Inference: Assessing Overlap in Covariate Distributions

1. Why do we need to adjust for differences in the covariate distributions?

2. Assessing overlap of univariate distributions

3. Assessing overlap of multivariate distributions

Using Mahalanobis distance to compare differences of multivariate distributions

Using propensity scores to compare distributions

4. Estimating if T/C units have comparable twins in C/T

Recommendations as treatments

Causal Inference cheatsheet

Core concepts

The assumptions that actually bite

Picking a method: the one thing that matters for each

Things I’ve learned to check before trusting a result

Relationship of covariance and dot product

The Relationship Between Covariance and Dot Product

Dot Product

Covariance

Relationship

Summary

Deep dive into MLOps.

Motivation

Key points:

Complex Models Erode Boundaries

1. Entanglement

2. Correction Cascades (chain reaction)

3. Undeclared Consumers (Please let us know if you are using!)

Data Dependencies Cost More than Code Dependencies

1. Unstable Data Dependencies (Can I rely on you?)

2. Underutilized Data Dependencies (How will this feature affect me in the future?)

3. Static Analysis of Data Dependencies (Data Catalogs!)

Feedback Loops

1. Direct Feedback Loops (Learning wrong things)

2. Hidden Feedback Loops (We were connected???)

ML-System Anti-Patterns

1. Glue Code (Packages should be replaceable –> APIs)

2. Pipeline Jungles (Keep things organized)

3. Dead Experimental Codepaths (Clean your room!)

4. Abstraction Depth (Can we abstract this?)

5. Common Smells (Something’s fishy…)

Configuration Debt

Dealing with the Changes in the External World

1. Fixed Thresholds in Dynamic Systems

2. Monitoring and Testing

Other Areas of ML-related Debt

1. Data Testing Debt

2. Reproducibility Debt

3. Process Management Debt

4. Cultural Debt

Conclusion

Data engineering: simple and complex data pipelines

Takeaways from Kaggle’s “Jane Street Market Prediction” competition

Contents

Anonymized data set know-how

Cross-validation know-how

Neural network optimization with Keras-tuner

*Using propensity scores to c**ompare distributions*