<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Applied Data Science &amp; ML</title>
<link>https://bilguunbatsaikhan.com/</link>
<atom:link href="https://bilguunbatsaikhan.com/index.xml" rel="self" type="application/rss+xml"/>
<description>Hi, my name is Billy. This is where I try to archive topics related to Applied Data Science &amp; ML.</description>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Sat, 11 Apr 2026 15:00:00 GMT</lastBuildDate>
<item>
  <title>Changes to my workflow</title>
  <link>https://bilguunbatsaikhan.com/posts/changes-to-my-workflow/</link>
  <description><![CDATA[ 




<p>It’s been a while. I’ve been training for a triathlon and between that, life changes, and the day job, writing blog posts fell off the priority list. But something has shifted enough in how I work that I wanted to document it before it becomes normal and I forget what it replaced.<br>
<br>
As I have mentioned in my previous posts, I build data pipelines, ML and causal inference models, and internal analytics tools. Currently, I’m working on a research paper and pushing toward a conference deadline. For the past several months, I’ve been using an internal coding assistant that lives in my terminal: it reads my files, runs shell commands, talks to cloud services, and maintains context across long sessions. Not autocomplete. Closer to a junior engineer who has read every file in your repo and takes direction well (conditional on the task), but will confidently ship untested code if you let it.<br>
The interesting thing isn’t the code generation. It’s that my job has quietly shifted from writing code to delegating work and evaluating output. That’s a different skill entirely, and most people using these tools haven’t realized the transition is happening to them.</p>
<section id="api-that-went-through-five-revisions" class="level3">
<h3 class="anchored" data-anchor-id="api-that-went-through-five-revisions">API That Went Through Five Revisions</h3>
<p>I was building an analysis engine for a serverless function that loads a hierarchical metric tree from a data warehouse (think: a top-level KPI decomposes into sub-metrics algebraically, each decomposing further across ~100 nodes), and propagates what-if changes through the tree. The agent wrote the first version in hours. It worked. Tests passed.<br>
<br>
What followed was five rounds of refactoring the same module. Each iteration, the agent rewrote the code, updated tests, redeployed. I kept pushing: “no, these columns aren’t tree nodes, they’re raw inputs to derived ratios. The naming should reflect that.” A flat lookup dict became a typed configuration with a dataclass. Each round got cleaner.<br>
<br>
I was doing design work, not typing work. The coding agent handled mechanical refactoring: renaming across files, updating assertions, regenerating query logic. I decided whether the abstraction made sense.<br>
<br>
The code review came back with a pointed question: “Do we still need this data-loading mode?” I initially argued yes! The reviewer was right though: another tool already handled data inspection. And because each service should do one thing, we ripped the duplicated data loading module out, rewrote the interface from four modes to two, updated three packages, revalidated all nodes, and merged. The agent did all the mechanical work and I was responsible for detecting “smelly” work and push back when necessary.</p>
</section>
<section id="writing-math-proofs" class="level3">
<h3 class="anchored" data-anchor-id="writing-math-proofs">Writing Math Proofs</h3>
<p>This is where it got interesting. My paper extends a recent sensitivity analysis framework from linear models to double machine learning. The theory section was a 23-line stub. My collaborator said it needed to become the centerpiece. I had the agent download the original paper from arxiv, extract the proofs, and explain each theorem. Key finding was that the original proofs use linear algebra projections everywhere and they don’t extend to nonparametric models. We couldn’t just cite them. We needed original proofs.<br>
<br>
After a comprhensive literature review, a three-tier proof structure emerged in one session. Proposition 1: the linear case reduces exactly to the original framework. Proposition 2: additive models, consistency via decomposition. Proposition 3: general nonlinear case via concentration inequalities. The agent wrote the LaTeX. I checked the math. Then I told it to run an adversarial audit on its own proofs.<br>
<br>
It found two critical errors. First, it used an inequality that requires independent random variables. Our sampling scheme produces negatively correlated variables. Wrong inequality. We switched to one that handles dependent sampling. Second, it claimed a variance function is always concave. I was skeptical. It produced a counterexample with two interacting binary variables where the marginal information gain *increases* which is the opposite of concavity. The claim was false.</p>
<p>The agent wrote a proof, disproved part of its own proof with a concrete counterexample, then rewrote the proposition with an honest three-case treatment. I’ve worked with human collaborators who wouldn’t catch that. At this point it feels like if I throw as many tokens as time allows and provide clean and organized context to the agent, solving great number of problems is just a matter of time (not gonna argue that its able to find the pareto optimal solution or not).</p>
</section>
<section id="thousands-of-cloud-training-jobs" class="level3">
<h3 class="anchored" data-anchor-id="thousands-of-cloud-training-jobs">Thousands of Cloud Training Jobs</h3>
<p>Next, for the above mentioned paper we needed a massive simulation study with nearly 3,000 configurations across effect sizes, confounding strengths, and increasing sample sizes. Each runs as a cloud training job. The agent wrote the dispatch script.</p>
<p>The dispatch job died at job 194. Root cause: O(n) bottleneck. For every new job, the script polled the status of all previously launched jobs. API rate limits turned this into a 40-second check per iteration. Throughput collapsed. Credentials expired.</p>
<p>But to fix this, the agent researched the cloud API’s behavior, found that job creation fails synchronously at the quota limit rather than queuing, and redesigned as fire-and-forget with retry-on-quota. The new version dispatched at maximum API rate, handled quota limits with backoff, and added resume flags for credential expiry. After this fix, all jobs running.<br>
<br>
Then it collected results, identified which hyperparameter actually dominates model bias (not the one I expected a 3x range from a single parameter), and generated the appendix analysis. Dispatch, collection, analysis in ONE day. That’s a week/s of deep focus work without the coding agent.</p>
</section>
<section id="where-it-goes-wrong" class="level3">
<h3 class="anchored" data-anchor-id="where-it-goes-wrong">Where It Goes Wrong</h3>
<p>One time the agent submitted a code review for three packages without deploying or running end-to-end validation. I managed to catch it and told the agent: “We must test the entire pipeline before sending the review. Actually make sure this never happens again.” To handle this, we added a pre-submission validation checklist as a permanent rule. The learning from this is that the agents at the current state always try to optimize for task completion and will skip validation unless we explicitly require it. One solution that seems to work well for this is to set up “hooks” that run automatically after builds. This way you can make the validation deteministic and dont rely on the agent to remember the routine, which becomes more of an issue with the increase in context sizes.<br>
<br>
Another pattern that I need to mention was when the paper draft had stale numbers from an old experiment. Agent’s adversarial audit caught it, which led to a key statistic changing meaningfully when we switched to proper cross-validation. The old numbers were wrong and had been sitting in the draft for days. The agent can catch errors, but only when you tell it to look. It doesn’t spontaneously doubt its own previous output, and in most cases, it will say that everything has been implemented and tested thoroughly. To create the adversarial audit, I have worked with the agent to read and compile the literature about what kind of steering intructions and methods tend to work well with coding agents. As a result, we created a manually invoked instruction document that I invoke whenever something smells fishy. Works pretty well so far.</p>
</section>
<section id="delegation-as-the-core-skill" class="level3">
<h3 class="anchored" data-anchor-id="delegation-as-the-core-skill">Delegation as the Core Skill</h3>
<p>Here’s what I think most people miss. Using these tools well is not about prompt engineering or knowing the right magic words. It’s about delegation and evaluation which are the same skills you need to manage people, applied to a machine.<br>
<br>
I write instruction sets loaded by task phase: thinking, researching, implementing, reviewing. I run adversarial audits where the agent attacks its own work. I enforce checklists. I push back when the abstraction is wrong. It’s less like pair programming and more like being a tech lead for a very fast, very literal engineer who occasionally produces brilliant work and occasionally tries to ship without testing.<br>
<br>
The evaluation part is critical and underappreciated. When the agent writes a proof, I need to know enough math to verify it. When it refactors a data mapping, I need to understand the schema well enough to judge whether the new abstraction is better. When it dispatches cloud jobs, I need to understand the API limits to know if the retry logic makes sense. I think I now understand what people meant when they said “AI amplifies your existing expertise”. If you don’t have the expertise, it amplifies your ability to produce confident-looking garbage.<br>
<br>
I wrote proofs, shipped a production API, dispatched thousands of training jobs, and debugged infrastructure WITHIN the same week and while training for a triathlon. A few months ago, that would have been a month of just the engineering work. The tools are crude, context windows degrade, and the agent drifts without anchoring. But for the kind of work where the hard part is the science and the design, and not the typing, it’s already a different game for me. And to think I’ve only started investing into learning agentic coding this January…..</p>
<p>Cheers, see yall next time!<br>
</p>


</section>

 ]]></description>
  <guid>https://bilguunbatsaikhan.com/posts/changes-to-my-workflow/</guid>
  <pubDate>Sat, 11 Apr 2026 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Going down a random rabbit hole: From XML Tags to $100M Weight Updates</title>
  <link>https://bilguunbatsaikhan.com/posts/the-claude-architecture-from-xml-tags-to-100m-weight-updates/</link>
  <description><![CDATA[ 




<p>I started with a small question: why does Anthropic insist on XML tags? It sounded like a formatting preference. The deeper I dug into the engineering, the more it looked like XML is less a style choice and more a statistical anchor the model’s parameters were trained to recognize. Here is the thread I pulled on.</p>
<section id="the-xml-attention-fence" class="level3">
<h3 class="anchored" data-anchor-id="the-xml-attention-fence">1. The XML “attention fence”</h3>
<p>The reason XML reads like a native interface to Claude comes down to how its self-attention heads were trained.</p>
<p>During pre-training, untrusted data gets wrapped in tags. That teaches the model to lower the weight it places on tokens inside something like <code>&lt;user_query&gt;</code> when it is predicting the next token for a system instruction. The tags end up acting as a structural fence, which is what keeps the model from mistaking your background data for a new command (sometimes called attention contamination).</p>
</section>
<section id="phase-1-supervised-fine-tuning-as-behavioral-cloning" class="level3">
<h3 class="anchored" data-anchor-id="phase-1-supervised-fine-tuning-as-behavioral-cloning">2. Phase 1: supervised fine-tuning as behavioral cloning</h3>
<p>This is the stage where the model learns the constitution, a rulebook of principles. It isn’t reasoning yet, it is mimicking.</p>
<p>The loop: the model generates a response, critiques it against the constitution, revises it, and then gets trained to clone that revised version. The mechanism is cross-entropy loss. For every token, the model outputs a probability vector <img src="https://latex.codecogs.com/png.latex?P"> over its full vocabulary (on the order of 100k tokens). You compare <img src="https://latex.codecogs.com/png.latex?P"> against a one-hot ground-truth vector <img src="https://latex.codecogs.com/png.latex?Y">, which is all zeros except a <img src="https://latex.codecogs.com/png.latex?1"> at the correct token’s index:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BLoss%7D%20=%20-%5Csum_i%20Y_i%20%5Clog(P_i)%0A"></p>
<p>If the model put 40% on “Hello” when the truth was <code>&lt;thinking&gt;</code> at 100%, backpropagation adjusts weights across the network to make <code>&lt;thinking&gt;</code> the statistically favored choice next time.</p>
</section>
<section id="phase-2-rlaif-and-ppo-the-optimization-stage" class="level3">
<h3 class="anchored" data-anchor-id="phase-2-rlaif-and-ppo-the-optimization-stage">3. Phase 2: RLAIF and PPO, the optimization stage</h3>
<p>Once the model knows <em>how</em> to speak, this phase teaches it judgment, using reinforcement learning from AI feedback. The PPO objective looks like:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctext%7BObjective%7D%20=%20%5Ctext%7BReward%7D%20-%20%5Cbeta%20%5Ccdot%20%5Cmathrm%7BKL%7D%5C!%5Cleft(%5Cpi_%7B%5Ctext%7Bnew%7D%7D%20%5C,%5C%7C%5C,%20%5Cpi_%7B%5Ctext%7Bold%7D%7D%5Cright)%0A"></p>
<p>Three pieces worth pulling apart:</p>
<ul>
<li><strong>The reward.</strong> An AI judge model scores responses. Follow the constitution and use structure correctly, and the response earns a positive scalar; the logic that produced it gets reinforced.</li>
<li><strong>The KL penalty.</strong> This is the part that matters. If the model tries to game the reward by drifting into gibberish the judge happens to like, the KL divergence term spikes and pulls it back toward the stable language model from Phase 1.</li>
<li><strong>PPO clipping.</strong> Proximal Policy Optimization caps how far the weights can move per update, which prevents model collapse from a few outlier rewards.</li>
</ul>
</section>
<section id="scaling-the-infrastructure-bill" class="level3">
<h3 class="anchored" data-anchor-id="scaling-the-infrastructure-bill">4. Scaling: the infrastructure bill</h3>
<p>Classic RLHF is bottlenecked by human reading speed. RLAIF removes the human from the inner loop, so the feedback runs at GPU-cluster speed instead. The cost isn’t labor at that point, it is the electricity and compute to run these self-improvement loops at scale.</p>
<hr>
<p><strong>The takeaway.</strong> Using XML tags isn’t about being tidy. It is about lining your prompt up with the statistical patterns the model’s weights were optimized to reward in the first place.</p>


</section>

 ]]></description>
  <guid>https://bilguunbatsaikhan.com/posts/the-claude-architecture-from-xml-tags-to-100m-weight-updates/</guid>
  <pubDate>Tue, 03 Mar 2026 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Estimating the Distribution of Omitted Variable Bias in Causal Inference</title>
  <link>https://bilguunbatsaikhan.com/posts/gemini-deep-research-summary-of-the-state-of-ovb/</link>
  <description><![CDATA[ 




<p>I went down a rabbit hole on omitted variable bias, specifically the question of not just <em>whether</em> a confounder biases your estimate, but how to put a <em>range</em> on how badly. These are my notes on the main families of methods people use, what each one buys you, and where each one breaks.</p>
<section id="what-ovb-actually-is" class="level3">
<h3 class="anchored" data-anchor-id="what-ovb-actually-is">What OVB actually is</h3>
<p>Omitted variable bias shows up when you leave out a variable that is correlated with both your treatment of interest and your outcome. The model has nowhere to put that variable’s influence, so it smears it onto the coefficients you did include, and your estimate of the causal effect drifts up or down.</p>
<p>The classic example: regressing salary on years of education while leaving out ability. If ability raises both how much schooling someone gets and how much they earn, the education coefficient absorbs part of ability’s effect, and you overstate the return to education.</p>
<p>Two conditions have to hold for the bias to exist:</p>
<ol type="1">
<li>the omitted variable genuinely affects the outcome, and</li>
<li>it is correlated with an included regressor.</li>
</ol>
<p>If either fails, omission is harmless. The direction of the bias follows the signs: same-sign correlations push the estimate up, opposite signs push it down. Reasoning through those signs without knowing the magnitude is what people call “signing the bias.”</p>
</section>
<section id="the-algebra" class="level3">
<h3 class="anchored" data-anchor-id="the-algebra">The algebra</h3>
<p>Suppose the true model is</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ay%20=%20b%5C,x%20+%20c%5C,z%20+%20u,%0A"></p>
<p>but you omit <img src="https://latex.codecogs.com/png.latex?z"> and fit</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Ay%20=%20a%5C,x%20+%20v.%0A"></p>
<p>Then the estimated coefficient on <img src="https://latex.codecogs.com/png.latex?x"> has expectation</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbb%7BE%7D%5Ba%5D%20=%20b%20+%20c%5C,f,%0A"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?f"> is the coefficient from regressing the omitted <img src="https://latex.codecogs.com/png.latex?z"> on the included <img src="https://latex.codecogs.com/png.latex?x">. The bias term is <img src="https://latex.codecogs.com/png.latex?c%5C,f">: it is zero exactly when <img src="https://latex.codecogs.com/png.latex?z"> has no effect on <img src="https://latex.codecogs.com/png.latex?y"> (<img src="https://latex.codecogs.com/png.latex?c=0">) or when <img src="https://latex.codecogs.com/png.latex?x"> and <img src="https://latex.codecogs.com/png.latex?z"> are uncorrelated (<img src="https://latex.codecogs.com/png.latex?f=0">). Clean, but not directly usable, because <img src="https://latex.codecogs.com/png.latex?c"> and <img src="https://latex.codecogs.com/png.latex?f"> both involve the thing you never observed. That is the whole problem, and every method below is a different way around it.</p>
</section>
<section id="five-ways-to-put-a-range-on-the-bias" class="level3">
<h3 class="anchored" data-anchor-id="five-ways-to-put-a-range-on-the-bias">Five ways to put a range on the bias</h3>
<p><strong>1. Sensitivity analysis.</strong> Instead of assuming no confounding, ask how strong a confounder would need to be to overturn your conclusion. The <em>robustness value</em> captures this: the minimum association (measured as partial <img src="https://latex.codecogs.com/png.latex?R%5E2">) an unobserved confounder must have with both treatment and outcome to drive the effect to zero. If a weak, implausible confounder is enough to flip the result, the finding is fragile; if only an implausibly strong one would do it, the finding is robust. Cinelli and Hazlett’s contour plots are the standard way to read this off visually, showing where the estimate stays significant across combinations of confounder strength.</p>
<p><strong>2. Bounding approaches.</strong> Rather than scan scenarios, fix an assumption about the <em>most</em> explanatory power any omitted variable could plausibly have, and derive hard upper and lower limits on the effect. The bounds are only as credible as that ceiling assumption. Set it too low and the true effect can fall outside your interval; set it too high and the bounds are so wide they say nothing. <em>Covariate benchmarking</em> anchors the assumption in data: argue that no unobserved confounder is stronger than, say, your strongest observed covariate, and use that as the empirical ceiling.</p>
<p><strong>3. Simulation.</strong> Build synthetic datasets where you control the omitted variable’s effect, then estimate the treatment effect while deliberately leaving it out, repeatedly. The spread of estimates is the empirical distribution of the bias. Useful for two things: seeing which conditions make OVB worst (stronger confounder-regressor correlation, larger effect on the outcome), and validating that a sensitivity method’s claimed bounds actually cover the truth.</p>
<p><strong>4. Bayesian methods.</strong> Put a prior on the unobserved confounder’s parameters (or directly on the bias), and the posterior on the causal effect carries that uncertainty through. You get a full distribution over plausible effects instead of a point estimate. The catch is the obvious one: if the posterior moves a lot when you change the prior, the data isn’t doing the work, your assumptions are.</p>
<p><strong>5. Machine learning.</strong> When relationships are non-linear or high-dimensional, flexible models estimate the nuisance functions better than a hand-specified linear model. DoubleML now ships OVB sensitivity analysis inside the double-machine-learning framework, and methods like BART give uncertainty estimates directly. The open problem is the usual one with flexible models: quantifying causal uncertainty rigorously, and explaining what the model is actually conditioning on.</p>
</section>
<section id="how-they-compare" class="level3">
<h3 class="anchored" data-anchor-id="how-they-compare">How they compare</h3>
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>Method</th>
<th>Strength</th>
<th>Where it breaks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Analytical</strong></td>
<td>Gives the fundamental picture of how OVB arises.</td>
<td>Relies on quantities you can’t observe.</td>
</tr>
<tr class="even">
<td><strong>Sensitivity analysis</strong></td>
<td>Quantifies how much confounding it takes to flip the result.</td>
<td>Gives no single “corrected” estimate.</td>
</tr>
<tr class="odd">
<td><strong>Bounding</strong></td>
<td>Returns an actual range for the effect.</td>
<td>Only as good as the plausibility ceiling you assume.</td>
</tr>
<tr class="even">
<td><strong>Simulation</strong></td>
<td>Controlled, lets you validate other methods.</td>
<td>Conclusions are hostage to the data-generating process you chose.</td>
</tr>
<tr class="odd">
<td><strong>Bayesian</strong></td>
<td>Carries uncertainty through to a full posterior.</td>
<td>Sensitive to prior specification.</td>
</tr>
<tr class="even">
<td><strong>Machine learning</strong></td>
<td>Handles non-linear, high-dimensional confounding.</td>
<td>Uncertainty quantification for causal effects is still maturing.</td>
</tr>
</tbody>
</table>
</section>
<section id="what-i-took-away" class="level3">
<h3 class="anchored" data-anchor-id="what-i-took-away">What I took away</h3>
<p>The honest core of all of it: you cannot measure what you did not observe, so every method here trades the impossible question (“what is the bias?”) for a tractable one (“how strong would a confounder have to be, and is that plausible here?”). Sensitivity analysis and covariate benchmarking are the two I find most useful in practice, because they hand the judgment back to domain knowledge instead of hiding it inside a prior or a simulated data-generating process.</p>
</section>
<section id="references" class="level3">
<h3 class="anchored" data-anchor-id="references">References</h3>
<p>The starting points worth reading if you want the real treatment, not my notes:</p>
<ul>
<li>Cinelli &amp; Hazlett (2020), <em>Making Sense of Sensitivity: Extending Omitted Variable Bias</em>. The robustness value and contour plots. <a href="https://carloscinelli.com/files/Cinelli%20and%20Hazlett%20(2020)%20-%20Making%20Sense%20of%20Sensitivity.pdf">PDF</a></li>
<li>Chernozhukov et al.&nbsp;(2021), <em>Long Story Short: Omitted Variable Bias in Causal Machine Learning</em>. Extends OVB to ML. <a href="https://arxiv.org/html/2112.13398v5">arXiv</a></li>
<li>DoubleML, <a href="https://docs.doubleml.org/stable/guide/sensitivity.html">sensitivity analysis documentation</a>.</li>
<li><a href="https://en.wikipedia.org/wiki/Omitted-variable_bias">Omitted-variable bias</a> on Wikipedia for the algebra.</li>
<li>Eggers, <a href="https://andy.egge.rs/teaching/oss/ici_2019/eggers_ICI_slides_v1.pdf">Intermediate Causal Inference</a> lecture slides.</li>
</ul>
<hr>
<p><strong>More causal inference notes:</strong></p>
<ul>
<li><a href="../causal-inference-cheatsheet/">Causal inference cheatsheet</a></li>
<li><a href="../causal-inference-assessing-overlap-in-covariate-distributions/">Assessing overlap in covariate distributions</a></li>
<li><a href="../recommendations-as-treatments/">Recommendations as treatments</a></li>
</ul>


</section>

 ]]></description>
  <category>CausalInference</category>
  <guid>https://bilguunbatsaikhan.com/posts/gemini-deep-research-summary-of-the-state-of-ovb/</guid>
  <pubDate>Sat, 05 Apr 2025 15:00:00 GMT</pubDate>
</item>
<item>
  <title>When Linear Regression Gets Massively Confused</title>
  <link>https://bilguunbatsaikhan.com/posts/when-linear-regression-gets-massively-confused/</link>
  <description><![CDATA[ 




<p>This one is about a failure mode I keep running into: a <strong>mass point</strong>, a large cluster of identical values in your data, quietly breaking linear regression. It often shows up right after a log transform, and the symptoms are easy to misread.</p>
<hr>
<section id="what-a-mass-point-is" class="level3">
<h3 class="anchored" data-anchor-id="what-a-mass-point-is">What a mass point is</h3>
<p>Say a large number of your observations share the same value, for example a crowd of customers who all bought a $1 product. You log-transform sales to make the distribution more normal, and every one of those $1 values collapses to zero, since <img src="https://latex.codecogs.com/png.latex?%5Clog(1)%20=%200">. Now the transformed data is dominated by a spike of zeros.</p>
<p>Visually: the original data has a tall spike at <img src="https://latex.codecogs.com/png.latex?T=1">, and after the log transform that spike moves to <img src="https://latex.codecogs.com/png.latex?0"> and gets taller relative to everything else.</p>
<p><img src="https://bilguunbatsaikhan.com/images/2025/03/Screenshot-2025-03-27-at-15.41.21.png" class="img-fluid"></p>
<hr>
</section>
<section id="why-linear-models-struggle-with-it" class="level3">
<h3 class="anchored" data-anchor-id="why-linear-models-struggle-with-it">Why linear models struggle with it</h3>
<p>Linear regression leans on a few assumptions:</p>
<ol type="1">
<li><strong>Linearity</strong>: the relationship between predictors and outcome is linear.</li>
<li><strong>Independence of errors</strong>: residuals are independent of each other.</li>
<li><strong>Homoscedasticity</strong>: residuals have constant variance across all levels of the predictors.</li>
<li><strong>Normality</strong>: residuals are roughly normally distributed.</li>
</ol>
<p>A mass point mainly attacks the last two. With a big concentration of identical values (here, the spike at <img src="https://latex.codecogs.com/png.latex?%5Clog(1)%20=%200">), two things go wrong:</p>
<ul>
<li><strong>Non-normal residuals.</strong> The mass point pulls residuals toward itself, introducing asymmetry and distorting their spread, so they no longer sit balanced and normal around zero.</li>
<li><strong>Heteroscedasticity.</strong> Residual variance stops being constant, because the model can’t fit the dense region and the sparse region equally well. That matters because non-constant variance gives you biased standard errors (so t-tests and confidence intervals become unreliable) and inefficient estimates (predictions are less precise than they could be).</li>
</ul>
<p>The net effect: the model tries to fit a straight line through data that violates its own assumptions, and you get biased or inefficient estimates. On a scatter plot you can see the fitted line bending awkwardly around the cluster at zero.</p>
<p><img src="https://bilguunbatsaikhan.com/images/2025/03/Screenshot-2025-03-27-at-15.42.41.png" class="img-fluid"></p>
<hr>
</section>
<section id="models-that-handle-it-better" class="level3">
<h3 class="anchored" data-anchor-id="models-that-handle-it-better">Models that handle it better</h3>
<p>Instead of forcing a single linear fit, use a model that explicitly accounts for the mass point. Three options, with tradeoffs:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
<col style="width: 20%">
</colgroup>
<thead>
<tr class="header">
<th>Model</th>
<th>Key idea</th>
<th>Pros</th>
<th>Cons</th>
<th>When to use</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Hurdle</strong></td>
<td>A classifier first decides whether an observation is in the mass-point group; a separate regression models the continuous values for the rest.</td>
<td>Cleanly separates the spike from the continuous part.</td>
<td>Two models to fit and coordinate.</td>
<td>A mix of a mass point (e.g.&nbsp;lots of <img src="https://latex.codecogs.com/png.latex?T=1"> or zeros) and continuous data.</td>
</tr>
<tr class="even">
<td><strong>Mixture</strong></td>
<td>Model the data as a blend of two distributions: one for the spike, one for the smooth continuous part.</td>
<td>Flexible, captures distinct subpopulations in one model.</td>
<td>Harder to fit; estimating the mixture components can be unstable.</td>
<td>Data that naturally splits into a sharp spike plus a smooth distribution.</td>
</tr>
<tr class="odd">
<td><strong>Zero-inflated</strong></td>
<td>A logistic component models excess zeros; a continuous model handles the rest.</td>
<td>Purpose-built for far more zeros than a standard model expects.</td>
<td>Best only when the mass point really is at zero; more setup.</td>
<td>An unusually high count of zeros (or one specific mass point).</td>
</tr>
</tbody>
</table>
<hr>
</section>
<section id="takeaway" class="level3">
<h3 class="anchored" data-anchor-id="takeaway">Takeaway</h3>
<p>A giant spike at zero in log-transformed data is a signal, not noise to ignore. Linear regression will fit it, but the standard errors and efficiency you rely on quietly stop being trustworthy. A hurdle, mixture, or zero-inflated model fits the structure of the data instead of fighting it.</p>
<hr>
</section>
<section id="appendix-why-heteroscedasticity-matters" class="level3">
<h3 class="anchored" data-anchor-id="appendix-why-heteroscedasticity-matters">Appendix: why heteroscedasticity matters</h3>
<p><strong>Effect on the estimates</strong></p>
<ol type="1">
<li><strong>Biased standard errors.</strong> Under heteroscedasticity the standard errors of the coefficients are unreliable, which throws off confidence intervals and hypothesis tests and raises the chance of Type I or Type II errors.</li>
<li><strong>Inefficient estimates.</strong> OLS stays unbiased, but it is no longer the best linear unbiased estimator (BLUE). The estimates have higher variance than necessary, so they are less precise.</li>
</ol>
<p><strong>Why efficiency is worth caring about</strong></p>
<p>Efficiency means getting estimators with the smallest possible variance. It buys you two things:</p>
<ol type="1">
<li><strong>Precision.</strong> Tighter confidence intervals, so the estimates are more reliable.</li>
<li><strong>Statistical power.</strong> More precision means tests are likelier to detect real effects, lowering the risk of Type II errors.</li>
</ol>


</section>

 ]]></description>
  <category>DataScienceBasics</category>
  <category>Know-how</category>
  <category>regression</category>
  <guid>https://bilguunbatsaikhan.com/posts/when-linear-regression-gets-massively-confused/</guid>
  <pubDate>Fri, 14 Mar 2025 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Causal Inference: Assessing Overlap in Covariate Distributions</title>
  <link>https://bilguunbatsaikhan.com/posts/causal-inference-assessing-overlap-in-covariate-distributions/</link>
  <description><![CDATA[ 




<p><em>This one summarizes what I’ve captured from Chapter 14 of “Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction” book by Guido W. Imbens and Donald B. Rubin.</em></p>
<p>Few shortcuts I’ve used:</p>
<ul>
<li>T - shortcut for treatment group</li>
<li>C - shortcut for control group</li>
</ul>
<p>Covered sections and their summary:</p>
<ul>
<li>14.1 Introduces why adjusting for differences in the covariate distributions is needed</li>
<li>14.2 Goes over a case with two univariate distributions</li>
<li>14.3 Goes over a case with two multivariate distributions</li>
<li>14.4 Explains the role of propensity of scores in assessing the overlap of covariate distributions</li>
<li>14.5 Provides a measure to assess if each treatment/control unit has an identical non-treated/treated twin</li>
<li>14.6 Provides examples (skipped)</li>
</ul>
<hr>
<section id="why-do-we-need-to-adjust-for-differences-in-the-covariate-distributions" class="level3">
<h3 class="anchored" data-anchor-id="why-do-we-need-to-adjust-for-differences-in-the-covariate-distributions">1. Why do we need to adjust for differences in the covariate distributions?</h3>
<p>If there is a region in the covariate space where T has relatively few units/samples or C has relatively few units/samples, our inferences in that region will largely depend on extrapolation, thus will be less credible compared to inferences of regions where both T and C have substantial overlap in the covariates distribution.<br>
<br>
Example:<br>
<em>Covariate space #1:</em></p>
<ul>
<li>T has 5 males with age&lt;=18</li>
<li>C has 55 males with age&lt;=18.</li>
</ul>
<p><em>Covariate space #2:</em></p>
<ul>
<li>T has 55 males age&gt;18 and age&lt;=30</li>
<li>C has 60 males age&gt;18 and age&lt;=30</li>
</ul>
<p>In the above example, our inferences for the covariate space #2 will be more credible than our inferences for the covariate space #1.<br>
<br>
To note, even in cases when there is there is no confounding (unconfoundedness), this is still a fundamental issue. However, if we have a completely randomized experiment, we can expect the distribution of covariates to be similar per definition, and thus, less risks of stumbling upon this issue.</p>
<hr>
</section>
<section id="assessing-overlap-of-univariate-distributions" class="level3">
<h3 class="anchored" data-anchor-id="assessing-overlap-of-univariate-distributions">2. Assessing overlap of univariate distributions</h3>
<p><strong><em>1. Comparing differences in location of two distributions</em></strong><br>
<br>
Given two univariate probability distributions <img src="https://latex.codecogs.com/png.latex?f_c(x)"> and <img src="https://latex.codecogs.com/png.latex?f_t(x)"> with means <img src="https://latex.codecogs.com/png.latex?%5Cmu_c"> and <img src="https://latex.codecogs.com/png.latex?%5Cmu_t">, and variances <img src="https://latex.codecogs.com/png.latex?%5Csigma_c%5E2"> and <img src="https://latex.codecogs.com/png.latex?%5Csigma_t%5E2">, we can estimate the differences in locations of the two distributions with:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta(ct)%20=%20(%5Cmu_t%20-%20%5Cmu_c)%20%5Cbig/%20%5Csqrt%7B(%5Csigma_t%5E2%20+%20%5Csigma_c%5E2)/2%7D%0A"></p>
<p>, which is a normalized measure of difference of two distributions. To note, this is different from the t-statistic that tests whether the data contain sufficient info to support the hypothesis that two covariate means in T and C are different:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0AT(ct)%20=%20(%5Cmu_t%20-%20%5Cmu_c)%20%5Cbig/%20%5Cleft(%20%5Csqrt%7B%5Csigma_c%5E2/N_c%20+%20%5Csigma_t%5E2/N_t%7D%20%5Cright)%0A"></p>
<p>To our purposes, we don’t care about testing if we have <u>enough data to check if means are different</u> between T&amp;C. Instead, we care about understanding if the differences between the distributions are so large that we need to fix the issue, and at the same time, check what kind of adjustment methods are required for the problem at hand.<br>
<br>
<strong><em>2. Comparing measures of dispersion of two distributions</em></strong><br>
<br>
The log difference in the standard deviations helps to understand the difference in dispersion:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CGamma(ct)%20=%20%5Cln(%5Csigma_t/%5Csigma_c)%20=%20%5Cln(%5Csigma_t)%20-%20%5Cln(%5Csigma_c)%0A"></p>
<p>, we take the log because its more normally distributed compared to simple difference or ratio of standard deviations.<br>
<br>
<strong><em>3. Comparing the tail overlaps of two distributions</em></strong><br>
<br>
To understand the overlap of covariate distributions in the tail regions, we can calculate the fraction of treatment units that exist in the tails of the distribution of covariate values in the control.<br>
<br>
One method is to choose an arbitrary tail boundary, e.g.&nbsp;<img src="https://latex.codecogs.com/png.latex?%5Calpha%20=%200.05"> and calculate the probability mass of T’s covariate distribution outside the <img src="https://latex.codecogs.com/png.latex?%5Calpha/2"> and <img src="https://latex.codecogs.com/png.latex?1%20-%20%5Calpha/2"> quantiles of the C’s covariate distribution:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi_t%5E%7B%5Calpha%7D%20=%20F_t%5C!%5Cbig(F_c%5E%7B-1%7D(%5Calpha/2)%5Cbig)%20+%20%5CBig(1%20-%20F_t%5C!%5Cbig(F_c%5E%7B-1%7D(1%20-%20%5Calpha/2)%5Cbig)%5CBig)%0A"></p>
<p>, where <img src="https://latex.codecogs.com/png.latex?F_t"> and <img src="https://latex.codecogs.com/png.latex?F_c"> represent the CDFs of the T and C covariates’ distributions (so <img src="https://latex.codecogs.com/png.latex?F_c%5E%7B-1%7D"> is the quantile function of the control distribution). We can do the same for C, where we calculate the probability mass of C’s covariate distribution outside the <img src="https://latex.codecogs.com/png.latex?%5Calpha/2"> and <img src="https://latex.codecogs.com/png.latex?1%20-%20%5Calpha/2"> quantiles of the T’s covariate distribution:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cpi_c%5E%7B%5Calpha%7D%20=%20F_c%5C!%5Cbig(F_t%5E%7B-1%7D(%5Calpha/2)%5Cbig)%20+%20%5CBig(1%20-%20F_c%5C!%5Cbig(F_t%5E%7B-1%7D(1%20-%20%5Calpha/2)%5Cbig)%5CBig)%0A"></p>
<p>The intuition here is that it is relatively easy to impute outcomes values of either T or C units in the dense regions of the distribution (e.g.&nbsp;near the mean of a normal distribution where majority of samples are concentrated), while it is harder at the tail regions where only few units exist. One can refer to the following two distributions as an example of two differently dispersed covariates’ distributions:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2024/09/overlap-embedded-01.png" class="img-fluid figure-img"></p>
<figcaption>Screenshot 2024-09-03 at 17.35.21.png</figcaption>
</figure>
</div>
<p>Last but not least, here’s how to interpret the values:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_c%5E%7B%5Calpha%7D%20=%20%5Cpi_t%5E%7B%5Calpha%7D%20=%20%5Calpha"> happens for a completely randomized experiment, where only <img src="https://latex.codecogs.com/png.latex?%5Calpha%20%5Ctimes%20100%5C%25"> of units have covariate values that make us cry (i.e.&nbsp;difficult to impute the missing potential outcomes of a group).</li>
<li>if <img src="https://latex.codecogs.com/png.latex?%5Cpi_t%5E%7B%5Calpha%7D%20%3E%20%5Calpha">, it will be relatively difficult to impute/predict the missing potential outcomes for the <strong>control</strong> units (more treatment units exist in the tail region of the control group’s covariate distribution).</li>
<li>if <img src="https://latex.codecogs.com/png.latex?%5Cpi_c%5E%7B%5Calpha%7D%20%3E%20%5Calpha">, it will be relatively difficult to impute/predict the missing potential outcomes for the <strong>treatment</strong> units (more control units exist in the tail region of the treatment group’s covariate distribution).</li>
</ul>
<hr>
</section>
<section id="assessing-overlap-of-multivariate-distributions" class="level3">
<h3 class="anchored" data-anchor-id="assessing-overlap-of-multivariate-distributions">3. Assessing overlap of multivariate distributions</h3>
<p>Given K covariates, we could compare the distributions of T &amp; C iteratively one-by-one using the above approach. A good way is to start with the covariates for which you have a priori belief that it is highly associated with the outcome (i.e.&nbsp;ensuring the importatnt covariates are okay).</p>
<section id="using-mahalanobis-distance-to-compare-differences-of-multivariate-distributions" class="level4">
<h4 class="anchored" data-anchor-id="using-mahalanobis-distance-to-compare-differences-of-multivariate-distributions"><em>Using Mahalanobis distance to c<strong>ompare differences of multivariate distributions</strong></em></h4>
<p>In addition to above metrics that can be used for univariate distributions, there is a neat multivariate summary measure that captures the difference in locations between two distributions similar to the one mentioned in the first part of section 2. It leverages the Mahalanobis distance and takes as input the 2 by K-dimensional vector of distribution means of each covariate <img src="https://latex.codecogs.com/png.latex?%5Cmu_c"> and <img src="https://latex.codecogs.com/png.latex?%5Cmu_t">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CDelta_%7Bct%7D%5E%7B%5Ctext%7Bmulti%7D%7D%20=%20%5Csqrt%7B(%5Cmu_t%20-%20%5Cmu_c)%5E%7B%5Ctop%7D%20%5C,%20%5Cbig((%5CSigma_c%20+%20%5CSigma_t)/2%5Cbig)%5E%7B-1%7D%20%5C,%20(%5Cmu_t%20-%20%5Cmu_c)%7D%0A"></p>
<p>, where</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5CSigma_c%20=%20%5Cfrac%7B1%7D%7BN_c%20-%201%7D%20%5Csum_%7Bi:W_i=0%7D%20(X_i%20-%20%5Cbar%7B%5Cmu%7D_c)(X_i%20-%20%5Cbar%7B%5Cmu%7D_c)%5E%7B%5Ctop%7D%0A%5Cquad%5Ctext%7Band%7D%5Cquad%0A%5CSigma_t%20=%20%5Cfrac%7B1%7D%7BN_t%20-%201%7D%20%5Csum_%7Bi:W_i=1%7D%20(X_i%20-%20%5Cbar%7B%5Cmu%7D_t)(X_i%20-%20%5Cbar%7B%5Cmu%7D_t)%5E%7B%5Ctop%7D%0A"></p>
<p>are the covariance matrices of distribution means for C &amp; T, respectively. Intuitively, a larger Mahalanobis distance tells us that the distributions are further apart in multivariate space, considering both location and spread.</p>
</section>
<section id="using-propensity-scores-to-compare-distributions" class="level4">
<h4 class="anchored" data-anchor-id="using-propensity-scores-to-compare-distributions"><em>Using propensity scores to c</em><strong><em>ompare distributions</em></strong></h4>
<p>We can use propensity scores to assess the balance of covariate distributions is that any difference in covariate distributions also shows up in the difference of propensity score distributions. In principle, &nbsp;difference in covariate distributions leads to difference in the expected propensity scores of T&amp;C, i.e.&nbsp;the average propensity scores. So if there’s a non-zero difference in the propensity score distributions for T&amp;C, it also implies there is a difference in the covariate distributions for T&amp;C.<br>
<br>
Since it is much less work to analyze the differences of univariate distributions compared to multivariate ones, leveraging propensity scores and their distributions which is a univariate distribution to analyze the differences makes things a bit simpler.<br>
<br>
How does it work? Consider t(x) to be the true propensity score (assuming we know the treatment assignment mechanism) and l(x) to be the linearized propensity score (log odds ratio) of being in T vs C given covariates x:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Al(x)%20=%20%5Cln%5C!%5Cleft(%5Cfrac%7Bt(x)%7D%7B1%20-%20t(x)%7D%5Cright)%0A"></p>
<p>Then, we can simply look at the normalized difference in means for the propensity scores of each treatment, where:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cbar%7Bl%7D_c%20=%20%5Cfrac%7B1%7D%7BN_c%7D%20%5Csum_%7Bi:W_i=0%7D%20l(X_i)%0A%5Cquad%5Ctext%7Band%7D%5Cquad%0A%5Cbar%7Bl%7D_t%20=%20%5Cfrac%7B1%7D%7BN_t%7D%20%5Csum_%7Bi:W_i=1%7D%20l(X_i)%0A"></p>
<p>are the average values for propensity scores of C &amp; T units, and:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0As_%7Bl,c%7D%5E2%20=%20%5Cfrac%7B1%7D%7BN_c%20-%201%7D%20%5Csum_%7Bi:W_i=0%7D%20%5Cbig(l(X_i)%20-%20%5Cbar%7Bl%7D_c%5Cbig)%5E2%0A%5Cquad%5Ctext%7Band%7D%5Cquad%0As_%7Bl,t%7D%5E2%20=%20%5Cfrac%7B1%7D%7BN_t%20-%201%7D%20%5Csum_%7Bi:W_i=1%7D%20%5Cbig(l(X_i)%20-%20%5Cbar%7Bl%7D_t%5Cbig)%5E2%0A"></p>
<p>are the variances of the propensity score, which leads us to the estimated difference in average propensity scores scaled by the square root of the average square within-treatment-group standard deviations:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Chat%7B%5CDelta%7D_%7Bct%7D%5E%7Bl%7D%20=%20(%5Cbar%7Bl%7D_t%20-%20%5Cbar%7Bl%7D_c)%20%5Cbig/%20%5Csqrt%7B(s_%7Bl,c%7D%5E2%20+%20s_%7Bl,t%7D%5E2)/2%7D%0A"></p>
<p>To note, if we are using linearized propensity scores, the propensity score function is scale-invariant, so there is not actually a need to normalize the above by the standard deviations.<br>
<br>
Moreover, compared to how we assessed the univariate covariate distributions, there are some points to be aware of:</p>
<ol type="1">
<li>Differences in the covariate distributions are implied by the variation in the propensity scores.</li>
<li>If the treatment assignment mechanism is biased some way then it is possible that the covariate distributions are similar, but we observe a difference in the propensity score distributions.</li>
<li>On another hand, the treatment assignment mechanism might not be biased, and the covariate distributions do indeed differ (implied by the difference in propensity score distributions).</li>
<li>If the covariate distributions of two treatments differ, then it must be that the expected value of the propensity score in T is larger than the expected value of the propensity score in the C (or vice-versa).</li>
</ol>
<p>As a result, we can understand that differences in covariate distributions in T vs C <u>imply, and are implied by</u>, the differences in the average value of the propensity scores of T &amp; C. Proof is captured in the book Chapter 14.4.</p>
<hr>
</section>
</section>
<section id="estimating-if-tc-units-have-comparable-twins-in-ct" class="level3">
<h3 class="anchored" data-anchor-id="estimating-if-tc-units-have-comparable-twins-in-ct">4. Estimating if T/C units have comparable twins in C/T</h3>
<p>Consider a unit <img src="https://latex.codecogs.com/png.latex?i"> with treatment <img src="https://latex.codecogs.com/png.latex?W_i">. For this unit <img src="https://latex.codecogs.com/png.latex?i"> we can determine if there is a comparable twin with treatment <img src="https://latex.codecogs.com/png.latex?%5Chat%7BW%7D_i%20=%201%20-%20W_i"> such that the difference in propensity scores <img src="https://latex.codecogs.com/png.latex?l(X_i)%20-%20l(%5Cbar%7BX%7D_i)"> is less than or equal to <img src="https://latex.codecogs.com/png.latex?l_u">, where <img src="https://latex.codecogs.com/png.latex?l_u"> is an upper threshold implying the difference in propensity scores is less than 10%.<br>
<br>
Intuitively, if there is a similar unit with opposite treatment and a similar propensity score, we may be able to obtain trustworthy estimates of causal effects without any extrapolation or additional work. However, if there are many such units without twins, it will be difficult to obtain credible estimates of causal effects, no matter what methods we use. A common method to solve this, given we have enough samples, is to trim the units without twins.<br>
<br>
We can estimate the degree of units with close comparison units by defining an indicator variable <img src="https://latex.codecogs.com/png.latex?%5Cvarsigma_i"> that flags whether unit <img src="https://latex.codecogs.com/png.latex?i"> has a comparable twin:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cvarsigma_i%20=%0A%5Cbegin%7Bcases%7D%0A1%20&amp;%20%5Ctext%7Bif%20a%20comparable%20twin%20exists%20for%20unit%20%7D%20i,%5C%5C%0A0%20&amp;%20%5Ctext%7Botherwise.%7D%0A%5Cend%7Bcases%7D%0A"></p>
<p>, then for each T &amp; C we can define two overlap measures implying the proportion of units with close comparisons:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0Aq_c%20=%20%5Cfrac%7B1%7D%7BN_c%7D%20%5Csum_%7Bi:W_i=0%7D%20%5Cvarsigma_i%0A%5Cquad%5Ctext%7Band%7D%5Cquad%0Aq_t%20=%20%5Cfrac%7B1%7D%7BN_t%7D%20%5Csum_%7Bi:W_i=1%7D%20%5Cvarsigma_i%0A"></p>
<hr>
<hr>
<p><strong>More causal inference notes:</strong></p>
<ul>
<li><a href="../causal-inference-cheatsheet/">Causal inference cheatsheet</a></li>
<li><a href="../gemini-deep-research-summary-of-the-state-of-ovb/">Estimating the distribution of omitted variable bias</a></li>
<li><a href="../recommendations-as-treatments/">Recommendations as treatments</a></li>
</ul>


</section>

 ]]></description>
  <category>CausalInference</category>
  <category>DataScience</category>
  <category>DataScienceBasics</category>
  <category>howto</category>
  <guid>https://bilguunbatsaikhan.com/posts/causal-inference-assessing-overlap-in-covariate-distributions/</guid>
  <pubDate>Sat, 31 Aug 2024 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Recommendations as treatments</title>
  <link>https://bilguunbatsaikhan.com/posts/recommendations-as-treatments/</link>
  <description><![CDATA[ 




<p>This is the summary of “Recommendations as Treatments” (<a href="https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/18141/18875">paper</a>) by Joachims et al.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2024/06/Screenshot-2024-06-17-at-13.26.47.png" class="img-fluid figure-img"></p>
<figcaption>Image created by Dall-e: “Humorous illustration of Inverse Propensity Weighting. It features a quirky professor explaining the concept to amused students in a classroom, with a seesaw representing the ‘Treated’ and ‘Untreated’ groups.”</figcaption>
</figure>
</div>
<p>The main proposal of the paper is to review recommendation systems to be considered as policies that decide what interventions to make in order to optimize a desired outcome. This opens the door for us to apply Causal Inference techniques in the context of recommendations systems, such as Inverse Propensity Weighting.</p>
<hr>
<p><strong>Offline A/B testing or Off-policy evaluation of recommender systems</strong></p>
<p>The idea is to <strong>use historical data to evaluate recommender systems and obtain unbiased estimates</strong> without resorting to online A/B tests, which can be slow, expensive and risky in terms of hurting the CX.</p>
<p>If we can evaluate different recommendation policies offline (i.e.&nbsp;which policy would have performed best, if we had used it instead of the policy that logged the data), we will be able to find optimal policies, i.e.&nbsp;innovate faster.</p>
<hr>
<p><strong>Problem with offline A/B tests</strong></p>
<p>Consider, we have two recommendation policies/models: a) Logging policy (the current production model and b) Target policy (the new policy we would like to test). Since the historical evaluation data is actually produced by the Logging policy, the distribution of the data will follow the Logging policy’s output distributions or whatever it has learned to output. Evaluating the Target policy on this data will be biased by whatever the Logging policy favors/outputs.</p>
<p>For example:</p>
<blockquote class="blockquote">
<p>Consider a simple movie recommender for a video streaming service. Imagine that, at each visit, the recommender selects a single movie to suggest to the users after they log in. Assume that we have collected data from this recommender using a stochastic policy whose distribution is weighted towards more popular movies, such as blockbuster superhero movies. When we use these data offline to evaluate new policies, we may mistakenly conclude that policies that favor superhero movies are better, since these movies are over-represented. <strong>This is the so-called ‘‘rich-get-richer” effect, wherein things that are already very successful become more so due to their ubiquity. At the same time, we may miss opportunities to provide more personalized recommendations due to insufficient data in certain niches</strong>. For instance, consider a user whose favorite genre is Scandinavian thrillers. They might also appreciate superhero movies, so recommending them an installment from the Avengers franchise is a safe bet. Yet that would be suboptimal; they would much prefer Midsommar, a horror film set in Sweden.</p>
</blockquote>
<p>Another way to put it is to understand that <strong>both the actions and the rewards are biased by the production policy</strong>.</p>
<p><u>1. Actions in the logged data are biased because:</u></p>
<ul>
<li>Production Policy’s Influence: The production policy (the system currently in use) determines which actions (e.g., recommendations) are made. If this policy has a preference for certain actions (like recommending popular movies), the data collected will be biased towards these actions.</li>
<li>Skewed Data: This means the observed actions in your data are not a random sample but are influenced by the production policy’s preferences. Therefore, some actions are overrepresented while others are underrepresented or not represented at all.</li>
</ul>
<p><u>2. Rewards in the logged data are biased because:</u></p>
<ul>
<li>Dependent on Actions: The rewards (e.g., clicks, purchases) are outcomes that result from the actions taken. Since the actions themselves are biased, the observed rewards are also biased. Popular items might receive more interactions simply because they are recommended more often, not necessarily because they are inherently better.</li>
<li>Conditional Bias: The rewards are conditional on the biased actions. For instance, the reward distribution for popular items might appear artificially inflated because these items are seen more frequently due to the production policy’s bias.</li>
</ul>
<hr>
<p><strong>Solution using Inverse Propensity Weighting</strong></p>
<p>In short, since the logged rewards are produced by the current production model, we can debias them by using the inverse propensity weights of the production model.</p>
<p>Let’s consider an example:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?x"> is the contextual features</li>
<li><img src="https://latex.codecogs.com/png.latex?a"> is the action/recommendation by the policy. For the sake of explanation, let’s say we have 3 possible actions: <img src="https://latex.codecogs.com/png.latex?a_A">, <img src="https://latex.codecogs.com/png.latex?a_B">, <img src="https://latex.codecogs.com/png.latex?a_C"></li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi"> is the production policy. In this example, <img src="https://latex.codecogs.com/png.latex?%5Cpi">’s propensity scores for the 3 possible actions are <img src="https://latex.codecogs.com/png.latex?0.75">, <img src="https://latex.codecogs.com/png.latex?0.15">, and <img src="https://latex.codecogs.com/png.latex?0.10"></li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> is the new policy. In this example, <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D">’s propensity scores for the 3 possible actions are <img src="https://latex.codecogs.com/png.latex?0.30">, <img src="https://latex.codecogs.com/png.latex?0.6">, <img src="https://latex.codecogs.com/png.latex?0.10"></li>
<li><img src="https://latex.codecogs.com/png.latex?n"> is the total number of samples</li>
<li><img src="https://latex.codecogs.com/png.latex?r"> is the reward from the action</li>
</ul>
<p><strong>Case #1:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi"> recommends <img src="https://latex.codecogs.com/png.latex?a_A">, and the user likes the recommendation and clicks on it.</p>
<p>The utility score of <img src="https://latex.codecogs.com/png.latex?%5Cpi"> will be 1 (correct recommendations / total recommendations <img src="https://latex.codecogs.com/png.latex?=%201/1%20=%201">). The utility score of <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> will be:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cunderbrace%7B0.30%7D_%7B%5Cpi_%7B%5Ctext%7Bnew%7D%7D(a_A)%7D%20%5Ccdot%20%5Cfrac%7B1%7D%7B%5Cunderbrace%7B0.75%7D_%7B%5Cpi(a_A)%7D%7D%20%5Ccdot%20%5Cfrac%7B1%7D%7Bn%7D%20=%200.30%20%5Ccdot%20%5Cfrac%7B1%7D%7B0.75%7D%20%5Ccdot%20%5Cfrac%7B1%7D%7B1%7D%20=%200.4%0A"></p>
<p>Comparing the utilities, the production policy <img src="https://latex.codecogs.com/png.latex?%5Cpi"> (utility <img src="https://latex.codecogs.com/png.latex?=1">) looks better than the new policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> (utility <img src="https://latex.codecogs.com/png.latex?=0.4">).</p>
<p><strong>Case #2:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi"> recommends <img src="https://latex.codecogs.com/png.latex?a_A">, the user clicks it, and <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> also has the highest propensity for <img src="https://latex.codecogs.com/png.latex?a_A"> at <img src="https://latex.codecogs.com/png.latex?0.9">.</p>
<p>The utility score of <img src="https://latex.codecogs.com/png.latex?%5Cpi"> will be 1. The utility score of <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> will be:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A0.90%20%5Ccdot%20%5Cfrac%7B1%7D%7B0.75%7D%20%5Ccdot%20%5Cfrac%7B1%7D%7B1%7D%20=%201.2%0A"></p>
<p>Now the new policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> (utility <img src="https://latex.codecogs.com/png.latex?=1.2">) looks better than the production policy <img src="https://latex.codecogs.com/png.latex?%5Cpi"> (utility <img src="https://latex.codecogs.com/png.latex?=1">).</p>
<p><strong>Case #3:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi"> recommends <img src="https://latex.codecogs.com/png.latex?a_A">, but the user doesn’t like it and clicks <img src="https://latex.codecogs.com/png.latex?a_B"> instead.</p>
<p>The utility score of <img src="https://latex.codecogs.com/png.latex?%5Cpi"> will be 0 (0 correct / 1 total). The utility score of <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> will be:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cunderbrace%7B0.60%7D_%7B%5Cpi_%7B%5Ctext%7Bnew%7D%7D(a_B)%7D%20%5Ccdot%20%5Cfrac%7B1%7D%7B%5Cunderbrace%7B0.15%7D_%7B%5Cpi(a_B)%7D%7D%20%5Ccdot%20%5Cfrac%7B1%7D%7Bn%7D%20=%200.60%20%5Ccdot%20%5Cfrac%7B1%7D%7B0.15%7D%20%5Ccdot%20%5Cfrac%7B1%7D%7B1%7D%20=%204%0A"></p>
<p>Comparing the utilities, the new policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%7B%5Ctext%7Bnew%7D%7D"> (utility <img src="https://latex.codecogs.com/png.latex?=4">) looks better than the production policy <img src="https://latex.codecogs.com/png.latex?%5Cpi"> (utility <img src="https://latex.codecogs.com/png.latex?=0">).</p>
<hr>
<p>In summary, &nbsp;the weighting mechanism of IPW helps us to understand how a new policy would have performed by explicitly handling the biases that exist in the offline evaluation data produced by the existing policy (e.g.&nbsp;the big get bigger effect).</p>
<p>By weighting each observed interaction with the inverse propensity of the production policy, we are essentially placing more importance on samples with lower probability to be recommended by the production policy and placing lower importance on samples with higher probability.</p>
<p>As illustrated in the examples above, this helps us to understand if the new policy performs better than the existing policy.</p>
<p>This summary captures the main idea behind using Causal Inference techniques in the context of recommendation engines. I have left out several interesting gems presented in the original paper (e.g.&nbsp;counterfactual risk minimization, fairness of recommendation engines, etc.), so have a look at the paper if you are interested in learning more :).</p>
<hr>
<p><strong>More causal inference notes:</strong></p>
<ul>
<li><a href="../causal-inference-cheatsheet/">Causal inference cheatsheet</a></li>
<li><a href="../causal-inference-assessing-overlap-in-covariate-distributions/">Assessing overlap in covariate distributions</a></li>
<li><a href="../gemini-deep-research-summary-of-the-state-of-ovb/">Estimating the distribution of omitted variable bias</a></li>
</ul>



 ]]></description>
  <category>CausalInference</category>
  <category>Recommendations</category>
  <guid>https://bilguunbatsaikhan.com/posts/recommendations-as-treatments/</guid>
  <pubDate>Sun, 16 Jun 2024 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Causal Inference cheatsheet</title>
  <link>https://bilguunbatsaikhan.com/posts/causal-inference-cheatsheet/</link>
  <description><![CDATA[ 




<p>I keep forgetting which method assumes what, so I put together a single page I can scan before reaching for a technique. It follows the terminology in Matheus Facure’s <a href="https://matheusfacure.github.io/python-causality-handbook">Causal Inference for the Brave and True</a>, which is the resource I keep coming back to.</p>
<p>It is not exhaustive, and it is not a substitute for reading the book. It is the thing I glance at to remember what “regression discontinuity” actually buys me before I commit to it.</p>
<section id="core-concepts" class="level3">
<h3 class="anchored" data-anchor-id="core-concepts">Core concepts</h3>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Concept</th>
<th>What it means</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Causal Inference</strong></td>
<td>Determining the cause-and-effect relationship between variables.</td>
<td>The impact of a new drug on patient recovery.</td>
</tr>
<tr class="even">
<td><strong>Treatment / Intervention</strong></td>
<td>The variable or action being studied for its effect on an outcome.</td>
<td>A new teaching method.</td>
</tr>
<tr class="odd">
<td><strong>Outcome</strong></td>
<td>The variable that is influenced by the treatment.</td>
<td>Student test scores.</td>
</tr>
<tr class="even">
<td><strong>Confounder</strong></td>
<td>A variable that influences both the treatment and the outcome, biasing the estimated effect.</td>
<td>Age in a study linking exercise to heart health.</td>
</tr>
<tr class="odd">
<td><strong>Randomized Controlled Trial (RCT)</strong></td>
<td>Units are randomly assigned to treatment or control to ensure comparability. See <a href="https://matheusfacure.github.io/python-causality-handbook/02-Randomised-Experiments.html">the chapter</a>.</td>
<td>Randomly assigning patients to a drug or a placebo.</td>
</tr>
<tr class="even">
<td><strong>Observational Study</strong></td>
<td>You observe the effect of treatments without controlling assignment.</td>
<td>Smoking and lung cancer, studied from existing data.</td>
</tr>
<tr class="odd">
<td><strong>Counterfactual</strong></td>
<td>What would have happened to the same units under a different treatment.</td>
<td>The unemployment rate had the stimulus not passed.</td>
</tr>
<tr class="even">
<td><strong>Selection Bias</strong></td>
<td>Bias from studying subjects who are not representative of the population.</td>
<td>Only healthy volunteers enroll, inflating the drug’s apparent effect.</td>
</tr>
<tr class="odd">
<td><strong>Instrumental Variables (IV)</strong></td>
<td>A variable that affects treatment but only touches the outcome through it.</td>
<td>Distance to the nearest college as an instrument for education.</td>
</tr>
<tr class="even">
<td><strong>Difference-in-Differences (DiD)</strong></td>
<td>Compare outcome changes over time between a treatment and a control group.</td>
<td>A new law’s effect, comparing regions before and after.</td>
</tr>
<tr class="odd">
<td><strong>Regression Discontinuity (RD)</strong></td>
<td>Use a cutoff to assign treatment, compare units just above and below it.</td>
<td>A scholarship’s effect, comparing students around the eligibility line.</td>
</tr>
<tr class="even">
<td><strong>Propensity Score Matching</strong></td>
<td>Match treated and untreated units with similar probability of being treated.</td>
<td>Matching patients on demographics and clinical history before comparing.</td>
</tr>
<tr class="odd">
<td><strong>Synthetic Control</strong></td>
<td>Build a weighted blend of control units to stand in for the treated one. See <a href="https://matheusfacure.github.io/python-causality-handbook/15-Synthetic-Control.html">the chapter</a>.</td>
<td>A policy’s effect in one country vs a synthetic of several others.</td>
</tr>
<tr class="even">
<td><strong>Mediation Analysis</strong></td>
<td>How an intermediate variable carries the effect from cause to outcome.</td>
<td>Stress reduction mediating exercise and mental health.</td>
</tr>
<tr class="odd">
<td><strong>Natural Experiment</strong></td>
<td>A real-world event that mimics random assignment.</td>
<td>A natural disaster’s effect on economic outcomes.</td>
</tr>
<tr class="even">
<td><strong>Heterogeneous Treatment Effects</strong></td>
<td>How the effect varies across subgroups.</td>
<td>Whether a job-training program helps differently by age or education.</td>
</tr>
<tr class="odd">
<td><strong>Panel Data and Fixed Effects</strong></td>
<td>Repeated observations on the same units to absorb time-invariant confounders.</td>
<td>Education policy, tracked across years of student data.</td>
</tr>
<tr class="even">
<td><strong>Synthetic DiD (SDID)</strong></td>
<td>Combines synthetic control and DiD.</td>
<td>A law’s effect, treated region vs synthetic control over time.</td>
</tr>
</tbody>
</table>
</section>
<section id="the-assumptions-that-actually-bite" class="level3">
<h3 class="anchored" data-anchor-id="the-assumptions-that-actually-bite">The assumptions that actually bite</h3>
<p>Most causal estimates fall apart not in the math but in one of these three. They are usually untestable, which is exactly why they are worth writing down.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>Assumption</th>
<th>What it means</th>
<th>Where it breaks</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>Ignorability / Exchangeability</strong></td>
<td>Conditional on observed covariates, treatment is as good as random.</td>
<td>An unmeasured confounder you never recorded.</td>
</tr>
<tr class="even">
<td><strong>SUTVA</strong></td>
<td>No interference between units, one version of treatment.</td>
<td>One person’s vaccination affecting another’s outcome.</td>
</tr>
<tr class="odd">
<td><strong>Common Support / Overlap</strong></td>
<td>Treatment and control overlap in covariate space.</td>
<td>A region where only treated units exist, forcing extrapolation.</td>
</tr>
</tbody>
</table>
</section>
<section id="picking-a-method-the-one-thing-that-matters-for-each" class="level3">
<h3 class="anchored" data-anchor-id="picking-a-method-the-one-thing-that-matters-for-each">Picking a method: the one thing that matters for each</h3>
<p>When I reach for one of these, the question is never “what does it do” but “what does it need to be true.” Here is the load-bearing assumption behind each, the thing that, if violated, quietly makes the estimate wrong.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 33%">
<col style="width: 33%">
<col style="width: 33%">
</colgroup>
<thead>
<tr class="header">
<th>Method</th>
<th>Use it when</th>
<th>The assumption it lives or dies on</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><strong>RCT</strong></td>
<td>You can assign treatment yourself.</td>
<td>Randomization actually happened (no broken blinding, no attrition).</td>
</tr>
<tr class="even">
<td><strong>IV</strong></td>
<td>Treatment is confounded but you have a clean instrument.</td>
<td>The instrument touches the outcome <em>only</em> through treatment. One alternate path kills it.</td>
</tr>
<tr class="odd">
<td><strong>DiD</strong></td>
<td>You have before/after data for a treated and a control group.</td>
<td>Parallel trends: the two groups would have moved together absent treatment.</td>
</tr>
<tr class="even">
<td><strong>RD</strong></td>
<td>Treatment flips at a sharp threshold.</td>
<td>Units can’t precisely manipulate which side of the cutoff they land on.</td>
</tr>
<tr class="odd">
<td><strong>Propensity Score Matching</strong></td>
<td>Selection is on observables you measured.</td>
<td>You actually observed the confounders. It can’t fix unobserved ones.</td>
</tr>
<tr class="even">
<td><strong>Synthetic Control</strong></td>
<td>One treated unit, many candidate controls, long pre-period.</td>
<td>The synthetic blend tracks the treated unit well <em>before</em> treatment.</td>
</tr>
<tr class="odd">
<td><strong>Mediation</strong></td>
<td>You want the pathway, not just the total effect.</td>
<td>No unmeasured confounding of the mediator-outcome link (the part people forget).</td>
</tr>
<tr class="even">
<td><strong>Panel / Fixed Effects</strong></td>
<td>Repeated measures, confounders that don’t change over time.</td>
<td>The confounders really are time-invariant.</td>
</tr>
<tr class="odd">
<td><strong>SDID</strong></td>
<td>DiD’s parallel-trends looks shaky and you have a panel.</td>
<td>A weaker, reweighted version of parallel trends still holds.</td>
</tr>
</tbody>
</table>
</section>
<section id="things-ive-learned-to-check-before-trusting-a-result" class="level3">
<h3 class="anchored" data-anchor-id="things-ive-learned-to-check-before-trusting-a-result">Things I’ve learned to check before trusting a result</h3>
<ul>
<li>Plot the overlap. If treated and control don’t share covariate space, the model is extrapolating and nobody told you.</li>
<li>Stress the key assumption, not the data. More features won’t save a design whose identification is broken.</li>
<li>Run a sensitivity analysis. “How big would an unobserved confounder have to be to flip this?” is often more honest than the point estimate.</li>
<li>Be suspicious of a clean answer from messy observational data. It usually means an assumption is doing more work than you realize.</li>
</ul>
<hr>
<p><strong>More causal inference notes:</strong></p>
<ul>
<li><a href="../causal-inference-assessing-overlap-in-covariate-distributions/">Assessing overlap in covariate distributions</a></li>
<li><a href="../gemini-deep-research-summary-of-the-state-of-ovb/">Estimating the distribution of omitted variable bias</a></li>
<li><a href="../recommendations-as-treatments/">Recommendations as treatments</a></li>
</ul>


</section>

 ]]></description>
  <category>CausalInference</category>
  <category>Cheatsheet</category>
  <guid>https://bilguunbatsaikhan.com/posts/causal-inference-cheatsheet/</guid>
  <pubDate>Sun, 14 Jan 2024 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Relationship of covariance and dot product</title>
  <link>https://bilguunbatsaikhan.com/posts/relationship-of-covariance-and-dot-product/</link>
  <description><![CDATA[ 




<section id="the-relationship-between-covariance-and-dot-product" class="level3">
<h3 class="anchored" data-anchor-id="the-relationship-between-covariance-and-dot-product">The Relationship Between Covariance and Dot Product</h3>
<p>Covariance and the dot product are related concepts in mathematics and statistics, particularly in the context of vectors and random variables. Here’s how they are connected:</p>
<section id="dot-product" class="level4">
<h4 class="anchored" data-anchor-id="dot-product">Dot Product</h4>
<p>The dot product (also known as the scalar product) of two vectors <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Ba%7D"> and <img src="https://latex.codecogs.com/png.latex?%5Cmathbf%7Bb%7D"> in <img src="https://latex.codecogs.com/png.latex?n">-dimensional space is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Cmathbf%7Ba%7D%20%5Ccdot%20%5Cmathbf%7Bb%7D%20=%20%5Csum_%7Bi=1%7D%5E%7Bn%7D%20a_i%20b_i%20.%0A"></p>
<p>This operation results in a single scalar value and provides a measure of the similarity between the two vectors. If the vectors point in the same direction, the dot product is positive and large; if they point in opposite directions, the dot product is negative; if they are orthogonal, the dot product is zero.</p>
</section>
<section id="covariance" class="level4">
<h4 class="anchored" data-anchor-id="covariance">Covariance</h4>
<p>Covariance is a measure of how much two random variables <img src="https://latex.codecogs.com/png.latex?X"> and <img src="https://latex.codecogs.com/png.latex?Y"> change together. It is given by:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BCov%7D(X,%20Y)%20=%20%5Cmathbb%7BE%7D%5C!%5Cleft%5B(X%20-%20%5Cmathbb%7BE%7D%5BX%5D)(Y%20-%20%5Cmathbb%7BE%7D%5BY%5D)%5Cright%5D%0A"></p>
<p>Where <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D"> denotes the expected value. If the covariance is positive, the variables tend to increase together; if it is negative, one tends to increase when the other decreases.</p>
</section>
<section id="relationship" class="level4">
<h4 class="anchored" data-anchor-id="relationship">Relationship</h4>
<p>The relationship between covariance and the dot product can be seen in the context of centered random variables. If you consider the centered random variables:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Ctilde%7BX%7D%20=%20X%20-%20%5Cmathbb%7BE%7D%5BX%5D,%20%5Cqquad%20%5Ctilde%7BY%7D%20=%20Y%20-%20%5Cmathbb%7BE%7D%5BY%5D,%0A"></p>
<p>the covariance between <img src="https://latex.codecogs.com/png.latex?X"> and <img src="https://latex.codecogs.com/png.latex?Y"> can be interpreted as the dot product of these centered variables in the space of random variables, normalized by the number of observations <img src="https://latex.codecogs.com/png.latex?n">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%0A%5Coperatorname%7BCov%7D(X,%20Y)%20=%20%5Cfrac%7B1%7D%7Bn%7D%20%5Csum_%7Bi=1%7D%5E%7Bn%7D%20%5Ctilde%7BX%7D_i%20%5Ctilde%7BY%7D_i%20.%0A"></p>
<p>Here, <img src="https://latex.codecogs.com/png.latex?%5Ctilde%7BX%7D_i"> and <img src="https://latex.codecogs.com/png.latex?%5Ctilde%7BY%7D_i"> are the deviations from the mean of the random variables <img src="https://latex.codecogs.com/png.latex?X"> and <img src="https://latex.codecogs.com/png.latex?Y"> respectively. This sum is analogous to the dot product, where the sum of the products of corresponding elements gives a measure of the overall relationship between the two sets of values.</p>
</section>
<section id="summary" class="level4">
<h4 class="anchored" data-anchor-id="summary">Summary</h4>
<p>In summary, covariance can be viewed as a normalized dot product of centered random variables. Both operations measure how two sets of values relate to each other, with the dot product being a geometric measure in vector space and covariance being a statistical measure in the space of random variables.</p>


</section>
</section>

 ]]></description>
  <category>DataScienceBasics</category>
  <guid>https://bilguunbatsaikhan.com/posts/relationship-of-covariance-and-dot-product/</guid>
  <pubDate>Mon, 31 Jan 2022 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Deep dive into MLOps.</title>
  <link>https://bilguunbatsaikhan.com/posts/deep-dive-into-mlops/</link>
  <description><![CDATA[ 




<section id="motivation" class="level3">
<h3 class="anchored" data-anchor-id="motivation">Motivation</h3>
<p>Machine Learning (ML) Proof Of Concept (POC) is one of the key phases in ML projects. Iteratively coming up with a better modeling approach, improving data quality, and finally, delivering a minimum viable solution for a grand problem is fascinating and you learn a lot.</p>
<p>At the end of the POC phase, if the results are satisfactory and stakeholders agree to move forward, the ML engineer gets tasked with finding and implementing an appropriate method for deploying the model into production. Depending on the business use case and the scale of the project (and many other factors), he/she will usually select to proceed with one of the following (non-exhaustive):</p>
<ol type="1">
<li><strong>Type 1</strong>: Save the trained model artifacts in the backend. Implement an inference logic that uses the trained model. The frontend engineer then requests the predictions from the backend via HTTP.</li>
<li><strong>Type 2</strong>: Choose to use one of the cloud services (Amazon AWS, Google GCP, Microsoft Azure) for training the model. Transfer the training data to the cloud, train the model and deploy it in the cloud. Set up a cloud endpoint and return predictions via HTTP requests.</li>
<li><strong>Type 3 (worst)</strong>: Implement the training and the inference logic in the backend. At the same time, implement the data <strong>e</strong>xtraction, <strong>t</strong>ransformation and the <strong>l</strong>oading (ETL) processes in the backend so that the system is able to retrain the model when needed. Try his/her best to setup monitoring, so that one day when things go wrong, he/she can find and fix the problem.</li>
<li><strong>Type 4</strong>: Use the cloud services to setup everything (ETL, data preprocessing, model training, monitoring, configuration, etc.). The frontend engineer then requests predictions via HTTP endpoint provided by the cloud service.</li>
<li><strong>Type 5</strong>: Use open-source packages, connect and build everything manually.</li>
</ol>
<p>There can be many other options too. However the key message here is that <strong>ML deployment can take on many forms.</strong></p>
<p>With ever increasing data and the expertise of the people dealing with the data, various ML models are being trained, tested, and deployed into production with unprecedented rates. However, with the increasing number of model deployments, ML practitioners have been experiencing a set of problems related to the unique property of ML-embedded systems - <strong>the usage of data</strong>. And this is exactly where MLOps comes to the rescue. Similar to the concept/culture of <a href="https://en.wikipedia.org/wiki/DevOps">DevOps</a> in the traditional software engineering discipline, the term <strong>MLOps refers to the engineering culture and best practices related to deploying and maintaining real-world ML systems</strong>.</p>
<hr>
<p><strong>There are many resources related to MLOps.</strong> Some of my favorites are:</p>
<ol type="1">
<li><a href="https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf">“Hidden Technical Debt in Machine Learning Systems”</a> - Google <strong>(Analyzed in this blog post.)</strong></li>
<li><a href="https://developers.google.com/machine-learning/guides">“Machine Learning Guides”</a> - Google</li>
<li><a href="https://www.youtube.com/watch?v=_xH7mlDGb0c">“MLOps for non-DevOps folks, a.k.a. “I have a model, now what?!”</a> - Hannes Hapke</li>
<li><a href="https://www.youtube.com/watch?v=06-AZXmwHjo">“From Model-centric to Data-centric AI”</a> - Andrew Ng</li>
<li><a href="https://www.youtube.com/watch?v=tsPuVAMaADY&amp;t=320s">“Bridging AI’s Proof-of-Concept to Production Gap”</a> - Andrew Ng</li>
<li><a href="https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning">“MLOps: Continuous delivery and automation pipelines in machine learning”</a> - Google</li>
<li><a href="https://www.youtube.com/watch?v=_jnhXzY1HCw">“What is ML Ops? Best Practices for DevOps for ML”</a> - Kaz Sato</li>
<li>…and so on.</li>
</ol>
<p>Perhaps the most popular resource on the above list is the paper called <a href="https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf">“Hidden Technical Debt in Machine Learning Systems”</a> by Google.</p>
<p>When I first read the paper, I found it to be highly useful in practice and understood that I need to go over this paper whenever I am building ML solutions. So to make my life a bit easier, I have decided to summarize the paper and do it so using simple terms and explanations.</p>
<hr>
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-19-at-11.15.12.png" class="img-fluid"></p>
</section>
<section id="key-points" class="level3">
<h3 class="anchored" data-anchor-id="key-points"><strong>Key points:</strong></h3>
<ol type="1">
<li><strong>Developing and deploying ML systems might seem fast and cheap</strong>, <strong>but</strong> <strong>maintaining</strong> <strong>it for the long term</strong> <strong>is</strong> <strong>difficult and expensive</strong>.</li>
<li><strong>Goal</strong> <strong>of dealing with technical debt is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability.</strong></li>
<li><strong>Data influences ML system behavior.</strong> <strong>Technical debt in ML systems may be difficult to detect because it exists at the system level rather than the code level.</strong></li>
</ol>
<hr>
</section>
<section id="complex-models-erode-boundaries" class="level3">
<h3 class="anchored" data-anchor-id="complex-models-erode-boundaries">Complex Models Erode Boundaries</h3>
<section id="entanglement" class="level4">
<h4 class="anchored" data-anchor-id="entanglement"><strong><u>1. &nbsp;Entanglement</u></strong></h4>
<ol type="1">
<li><strong>CACE principle:</strong><br>
<strong>CACE stands for Changing Anything Changes Everything.</strong> It refers to the entangled property of ML-systems, where changing the input distributions of a single feature can lead to changes in the rest of the features. CACE applies to input signals, hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak.</li>
<li><strong>Possible solution #1:</strong><br>
<strong>If the problem can be decomposed into sub-problems (disjointed, uncorrelated), train models for each sub-problem and serve ensemble models.</strong> In many cases ensembles work well because the errors in the component models are uncorrelated. <u>However, ensemble models lead to strong entanglement: improving an individual model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components</u>.</li>
<li><strong>Possible solution #2:</strong><br>
<strong>Monitor and detect changes in the prediction behavior in real-time.</strong> Use visualization tools to analyze the changes.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-19-at-14.19.32.png" class="img-fluid figure-img"></p>
<figcaption>Example of feature entanglement: complex correlation of features.</figcaption>
</figure>
</div>
</section>
<section id="correction-cascades-chain-reaction" class="level4">
<h4 class="anchored" data-anchor-id="correction-cascades-chain-reaction"><strong><u>2. Correction Cascades (chain reaction)</u></strong></h4>
<ol type="1">
<li>When we have a ready-to-use <strong>model m<sub>a</sub></strong> for <strong>problem A</strong> and need to train a new model for a <strong>problem A<sup>‘</sup><strong>, it makes sense to train a </strong>correction model m<sup>’</sup><sub>a</sub></strong> that takes as input the predictions of the <strong>model m<sub>a</sub></strong>.</li>
<li>This dependency can be cascaded further, for example training a <strong>model m<sup>’’</sup><sub>aa</sub></strong> based on the <strong>model m<sup>’</sup><sub>a</sub></strong>.</li>
<li>This dependency structury creates an <strong>improvement deadlock</strong>. When we try to improve the accuracy of a single model, other models are affected by the change causing system-level issues.</li>
<li><strong>Possible solution #1</strong>:<br>
Train and tune the <strong>model m<sub>a</sub></strong> directly for the <strong>problem A<sup>’</sup></strong> by adding appropriate features specific to the <strong>problem A<sup>’</sup></strong>.</li>
<li><strong>Possible solution #2</strong>:<br>
Create a new model for the <strong>problem A<sup>’</sup></strong>.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-19-at-12.21.47.png" class="img-fluid figure-img"></p>
<figcaption>Example of correction cascades.</figcaption>
</figure>
</div>
</section>
<section id="undeclared-consumers-please-let-us-know-if-you-are-using" class="level4">
<h4 class="anchored" data-anchor-id="undeclared-consumers-please-let-us-know-if-you-are-using"><u><strong>3. Undeclared Consumers</strong></u> <strong>(Please let us know if you are using!)</strong></h4>
<ol type="1">
<li>Often, <strong>predictions of a model are used in other services.</strong></li>
<li>If we don’t <strong>identify all end users of a model</strong>, later in the process, if we make changes to the model, the secret consumers will be <strong>affected silently and will face strange issues</strong>. Without clear definition of consumers, it becomes difficult and costly to make changes to the model at all.</li>
<li>In practice, <strong>engineers will choose to use the most accessible signal at hand</strong> (e.g.&nbsp;model predictions), especially when deadlines are approaching.</li>
<li><strong>Possible solution:</strong><br>
Setup <u>access restrictions</u> and <u>determine all consumers</u>. Similarly, setting up <a href="https://en.wikipedia.org/wiki/Service-level_agreement">service-level agreements (SLAs)</a> would also work.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-19-at-14.33.21.png" class="img-fluid figure-img"></p>
<figcaption>Example of an undeclared consumer.</figcaption>
</figure>
</div>
</section>
</section>
<section id="data-dependencies-cost-more-than-code-dependencies" class="level3">
<h3 class="anchored" data-anchor-id="data-dependencies-cost-more-than-code-dependencies">Data Dependencies Cost More than Code Dependencies</h3>
<section id="unstable-data-dependencies-can-i-rely-on-you" class="level4">
<h4 class="anchored" data-anchor-id="unstable-data-dependencies-can-i-rely-on-you"><u><strong>1. Unstable Data Dependencies</strong></u> <strong>(Can I rely on you?)</strong></h4>
<ol type="1">
<li><strong>ML-systems often consume data (input signals) produced by other systems.</strong></li>
<li><strong>Some data might be unstable</strong> (qualitatively or quantitatively changing over time).<br>
<u><strong>For example:</strong></u><br>

<ol type="a">
<li>Input data to <strong>system B</strong> is produced by a machine learning model from <strong>system A</strong>, and <strong>system A</strong> decides to update its model.<br>
</li>
<li>Input data comes from a data-dependent lookup table (e.g.&nbsp;calculates TF/IDF scores or semantic mappings).<br>
</li>
<li><strong>Engineering ownership</strong> <strong>of the input signal is separate from the engineering ownership of the model that consumes it.</strong> Engineers who own the input signal can make changes to the data at any time. This makes it costly for the engineers who consume the data to analyze how the change affects their system.<br>
</li>
<li><strong>Corrections in the input data can lead to detrimental consequences too!</strong> Similar to the CACE principle.<br>
<u>For example:</u><br>
A model that is previously trained on <u>mis-calibrated input signals</u> can start to behave strangely when the input data is recalibrated. This means that although the update was <u>meant to fix a problem</u>, it actually <u>introduced complications</u> to the system.</li>
</ol></li>
<li><strong>Possible solution:</strong><br>
<strong>Create versioned copy of the input data.</strong> Using stable version of the data until the new data is checked and tested will ensure that the system is stable. <strong>***</strong>Keep in mind that saving versioned copies of the data means we have to deal with potential data staleness and the cost of maintenance.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-19-at-16.27.59.png" class="img-fluid figure-img"></p>
<figcaption>Example of data dependency between systems.</figcaption>
</figure>
</div>
</section>
<section id="underutilized-data-dependencies-how-will-this-feature-affect-me-in-the-future" class="level4">
<h4 class="anchored" data-anchor-id="underutilized-data-dependencies-how-will-this-feature-affect-me-in-the-future"><u><strong>2. Underutilized Data Dependencies</strong></u> <strong>(How will this feature affect me in the future?)</strong></h4>
<ol type="1">
<li><strong>Underutilized data dependency - data that has very small incremental benefit for the model (0.0001% improvement)</strong>.</li>
<li><u><strong>Example:</strong></u><br>
<strong>A system introduces new product numbering logic to the system.</strong> Since instantly removing the old product numbering logic will lead to disasters (since other components depend on it), engineers decide to keep the new and the old numbering logics for the time being. This way, <strong>new products use the new product numbering logic only</strong>, but <strong>old products have both the new and the old product numbers</strong>. Accordingly, the model is retrained using both features continuing to rely on old product numbers for some products. A year later, <strong>when the old numbering logic code is deleted, it will be difficult for the maintainers to find what went wrong.</strong></li>
<li><strong>Types of underutilized data dependency</strong>:<br>
<strong>a) Legacy Features.</strong><br>
A feature <strong>F</strong> is included in a model early in its development. Over time, <strong>F</strong> is made redundant by new features but this goes undetected.<br>
<strong>b) Bundled Features.</strong><br>
During an experiment, a group of features is evaluated and found to be useful. Because of deadline pressures, the group of features are added to the model together, possibly including features with low value.<br>
<strong>c) ǫ-Features.</strong> As machine learning researchers, it is tempting to improve the model accuracy even when the accuracy gain is small and the complexity overhead is high.<br>
<strong>d) Correlated Features.</strong> Often two features are strongly correlated, but one is more directly causal. Many ML methods have difficulty detecting this and credit the two features equally, or may even pick the non-causal one. When the correlations disappear in the future, the model will perform poorly.</li>
<li><strong>Possible solution:</strong><br>
<strong>Leave-one-feature-out</strong> evaluations can be used to detect underutilized dependencies. Regularly performing this evaluation is recommended.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-19-at-18.11.04.png" class="img-fluid figure-img"></p>
<figcaption>Example of underutilized data dependency. Feature J is an old feature similar to Feature K. In the future, when we remove Feature J modules that depend on it will cause secret failures.</figcaption>
</figure>
</div>
</section>
<section id="static-analysis-of-data-dependencies-data-catalogs" class="level4">
<h4 class="anchored" data-anchor-id="static-analysis-of-data-dependencies-data-catalogs"><u><strong>3. Static Analysis of Data Dependencies</strong></u> <strong>(Data Catalogs!)</strong></h4>
<ol type="1">
<li>In traditional code, compilers and build systems perform static analysis of dependency graphs. <strong>Tools for static analysis of data dependencies</strong> are far less common, but <strong>are essential for error checking, tracking down consumers, and enforcing migration and updates</strong>.</li>
<li><strong>Annotating data sources and features with metadata</strong> (e.g.&nbsp;deprecation, platform-specific availability, and domain-specific applicability), and setting up automatic checks to make sure all annotations are updated <strong>helps to resolve the dependency tree</strong> (users and systems who use the data).</li>
<li>This kind of tooling can make <strong>migration and deletion much safer</strong> in practice.</li>
<li>Google’s solution can be found <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41159.pdf">here</a> (Section 8: “AUTOMATED FEATURE MANAGEMENT”).</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-19-at-18.40.03.png" class="img-fluid figure-img"></p>
<figcaption>Example of data dependency management.</figcaption>
</figure>
</div>
<hr>
</section>
</section>
<section id="feedback-loops" class="level3">
<h3 class="anchored" data-anchor-id="feedback-loops">Feedback Loops</h3>
<p>When it comes to <strong>live ML systems</strong> that update their behavior over time , it is difficult to predict how they will behave when they are released into production. This is called <strong>analysis debt</strong> - the difficulty of measuring the behavior of a model before deployment.</p>
<section id="direct-feedback-loops-learning-wrong-things" class="level4">
<h4 class="anchored" data-anchor-id="direct-feedback-loops-learning-wrong-things"><strong><u>1. Direct Feedback Loops</u> (Learning wrong things)</strong></h4>
<ol type="1">
<li>A model may directly influence the selection of its own future training data.<br>
<u>Example:</u><br>
In an e-commerce website a model might recommend 10 different products to a new customer. The customer <u>might choose a product</u> category that he/she needs to purchase at the moment, but <u>which is not related to his/her actual interests</u>. The <u>model captures this and retrains itself, learning the wrong interest</u> and becoming confident that the chosen product category conveys the customer’s interest.</li>
<li><strong>Possible Solution #1:</strong><br>
Acquiring customer feedback on the recommendations provided by the system.</li>
<li><strong>Possible Solution #2:</strong><br>
Increasing “exploration” of the model so that it doesn’t “exploit” small number of signals (<a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">Bandit Algorithms</a>).</li>
<li><strong>Possible Solution #3:</strong><br>
Using some amount of <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/bottou13a.pdf">randomization</a>.</li>
</ol>
</section>
<section id="hidden-feedback-loops-we-were-connected" class="level4">
<h4 class="anchored" data-anchor-id="hidden-feedback-loops-we-were-connected"><strong><u>2. Hidden Feedback Loops</u> (We were connected???)</strong></h4>
<ol type="1">
<li><strong>Hidden Feedback Loops - two systems influence each other indirectly (in a hidden, difficult-to-detect manner).</strong></li>
<li>Difficult to detect!</li>
<li>Improving one system may lead to changes in the behavior of the other.</li>
<li>Hidden feedback loops can also occur in completely disjoint systems.<br>
<u>Example:</u><br>
Scenario of two investment firms when one firm makes changes to its bidding algorithm (improvements, bugs, etc.) and the other firm’s model catches this signal and changes its behavior too (might lead to disasters).</li>
</ol>
<hr>
</section>
</section>
<section id="ml-system-anti-patterns" class="level3">
<h3 class="anchored" data-anchor-id="ml-system-anti-patterns">ML-System Anti-Patterns</h3>
<p><strong>In the real world, ML-related code takes a very small fraction of the entire system.</strong> Therefore, it is <u>important to consider other parts of the system</u> and <u>make sure that there are as few system-design anti-patterns as possible</u>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-20-at-17.05.22.png" class="img-fluid figure-img"></p>
<figcaption>Figure 1 in “Hidden Technical Debt in Machine Learning Systems”, Google.</figcaption>
</figure>
</div>
<section id="glue-code-packages-should-be-replaceable-apis" class="level4">
<h4 class="anchored" data-anchor-id="glue-code-packages-should-be-replaceable-apis"><strong><u>1. Glue Code</u></strong> (Packages should be replaceable –&gt; APIs)</h4>
<ol type="1">
<li><strong>Glue code system design</strong> - system with large portion of <u>code written to support the specific requirements of general-purpose packages</u>.</li>
<li>Glue code makes it <strong>difficult to test alternatives</strong> methods or make improvements <strong>in the future</strong>.</li>
<li>If <strong>ML code takes 5%</strong> and <strong>glue code takes 95%</strong> of the total code, it makes sense to implement a native solution rather than using general-purpose packages.</li>
<li><strong>Possible Solution:</strong><br>
Wrap general-purpose packages into <strong>common APIs</strong>. This way we can <strong>reuse the APIs</strong> <strong>and</strong> <strong>not worry</strong> <strong>about changing packages</strong> (e.g.&nbsp;s<a href="https://arxiv.org/pdf/1309.0238.pdf">cikit-learn’s Estimator API</a> with common fit(), predict() methods for all models).</li>
</ol>
</section>
<section id="pipeline-jungles-keep-things-organized" class="level4">
<h4 class="anchored" data-anchor-id="pipeline-jungles-keep-things-organized"><u><strong>2. Pipeline Jungles</strong></u> <strong>(Keep things organized)</strong></h4>
<ol type="1">
<li><strong>Usually occurs during data preparation.</strong></li>
<li><strong>Adding a new data source</strong> to the data pipeline 1-by-1<strong>,</strong> gradually makes it <strong>difficult to make improvements and detect errors</strong>.</li>
<li><strong>Possible Solution #1:</strong><br>
<strong>Think about data collection and feature extraction as a whole.</strong> Redesign the jungle pipeline from scratch. It might require a lot of effort, but is an investment worth the price.</li>
<li><strong>Possible Solution #2:</strong><br>
<strong>Make sure researchers and engineers are not overly separated.</strong> Packages written by the researchers may seem like black-box algorithms to the engineers who use them. If possible, <strong>integrate researchers and engineers into the same team</strong> (or create hybrid engineers).</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-20-at-18.42.28.png" class="img-fluid figure-img"></p>
<figcaption>Icons from thenounproject.com. Integrating researchers and engineers into one team.</figcaption>
</figure>
</div>
</section>
<section id="dead-experimental-codepaths-clean-your-room" class="level4">
<h4 class="anchored" data-anchor-id="dead-experimental-codepaths-clean-your-room"><u><strong>3.</strong></u> <strong><u>Dead Experimental Codepaths</u> (Clean your room!)</strong></h4>
<ol type="1">
<li><strong>Adding experimental code to the production system can cause issues in the long term.</strong></li>
<li>It becomes difficult/impossible to test all possible interactions between codepaths. The system becomes overly complex with many experimental branches.</li>
<li>“A famous example of the dangers here was Knight Capital’s system losing $465 million in 45 minutes, because of unexpected behavior from obsolete experimental codepaths.” (<a href="https://www.sec.gov/news/press-release/2013-222">source</a>)</li>
<li><strong>Possible Solution:</strong><br>
“<strong>Periodically reexamine each experimental branch</strong> to see what can be ripped out. Often only a small subset of the possible branches is actually used; many others may have been tested once and abandoned.”</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/Screen-Shot-2021-05-24-at-16.27.09.png" class="img-fluid figure-img"></p>
<figcaption>Example of dead code.</figcaption>
</figure>
</div>
</section>
<section id="abstraction-depth-can-we-abstract-this" class="level4">
<h4 class="anchored" data-anchor-id="abstraction-depth-can-we-abstract-this"><u><strong>4.</strong></u> <strong><u>Abstraction Depth</u> (Can we abstract this?)</strong></h4>
<ol type="1">
<li>Relational database is a successful example of abstraction.</li>
<li>In ML it is difficult to come up with robust abstraction (how to create abstractions for a data stream, a model and a prediction?).</li>
<li><u>Example abstraction:</u><br>
<strong><a href="https://web.eecs.umich.edu/~mosharaf/Readings/Parameter-Server.pdf">Parameter-server abstraction</a> - framework for distributed ML problems.</strong> “Data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices.”</li>
</ol>
</section>
<section id="common-smells-somethings-fishy" class="level4">
<h4 class="anchored" data-anchor-id="common-smells-somethings-fishy"><u><strong>5.</strong></u> <strong><u>Common Smells</u> (Something’s fishy…)</strong></h4>
<ol type="1">
<li>“<strong>Code Smells</strong> <strong>are patterns of code that suggest there might be a problem</strong>, that there might be a better way of writing the code or that more design perhaps should go into it.” (<a href="https://github.com/lee-dohm/code-smells#:~:text=Code%20Smell%20is%20a%20term,perhaps%20should%20go%20into%20it.">source)</a></li>
<li><strong>Plain-Old-Data Type Smells.</strong><br>
<strong>Check what flows in and out of ML systems.</strong> The data (signal, features, information, etc.) flowing in and out of ML systems is usually of type integer or float. “In a robust system, a model parameter should know if it is a log-odds multiplier or a decision threshold, and a prediction should know various pieces of information about the model that produced it and how it should be consumed.”</li>
<li><strong>Multiple-Language Smell.</strong><br>
Using multiple programming languages, no matter the great packages written in it, makes it difficult to efficiently test and, later, transfer the ownership to others.</li>
<li><strong>Prototype Smell.</strong><br>
Constantly using prototype environments is an indication that the production system is “<u>brittle, difficult to change, or could benefit from improved abstractions and interfaces</u>”.<br>
<u>This leads to 2 potential problems:</u><br>

<ul>
<li>Usage of the prototype environment as the production environment due to time pressures.<br>
</li>
<li>Prototype (small scale) solutions rarely reflect the reality of full-scale systems.</li>
</ul></li>
</ol>
<hr>
</section>
</section>
<section id="configuration-debt" class="level3">
<h3 class="anchored" data-anchor-id="configuration-debt">Configuration Debt</h3>
<p><strong>ML systems come with various configuration options:</strong><br>
1. &nbsp;Choice of input features<br>
2. &nbsp;Algorithm-specific learning configurations (hyperparameters, number of nodes, layers, etc.)<br>
3. &nbsp;Pre- &amp; post-processing methods<br>
4. &nbsp;Verification methods</p>
<p>In full-scale systems, configuration code far exceeds the number of lines of traditional code. Mistakes in configuration can lead to loss of time, waste of computing resources, and other production issues. Thus, <strong>verifying and testing configurations is crucial</strong>.</p>
<p><strong>Example of difficult-to-handle configurations (Choice of input features):</strong><br>
1. &nbsp;Feature A was <u>incorrectly logged</u> from 9/13-9/17.<br>
2. &nbsp;Feature B <u>isn’t available</u> before 10/7.<br>
3. &nbsp;Feature C has to have 2 <u>different acquisition methods</u> due to changes in logging.<br>
4. &nbsp;Feature D <u>isn’t available in production</u>. Substitute feature D’ must be used.<br>
5. &nbsp;Feature Z <u>requires extra training memory</u> for the model to train efficiently.<br>
6. &nbsp;Feature Q has <u>high latency</u>, so it makes Feature R also unusable.</p>
<p><u><strong>Configurations must be:</strong></u><br>
1. &nbsp;<strong>Easy to change.</strong> It should seem as making a small change to the previous configuration.<br>
2. &nbsp;<strong>Difficult to make manual errors, omissions, or oversights</strong>.<br>
3. &nbsp;<strong>Easy to analyze and compare with the previous version.</strong><br>
4. &nbsp;<strong>Easy to check</strong> (basic facts and details)<strong>.</strong><br>
5. &nbsp;<strong>Easy to detect unused or redundant settings.</strong><br>
6. &nbsp;<strong>Code reviewed and maintained in a repository.</strong></p>
<hr>
</section>
<section id="dealing-with-the-changes-in-the-external-world" class="level3">
<h3 class="anchored" data-anchor-id="dealing-with-the-changes-in-the-external-world">Dealing with the Changes in the External World</h3>
<p><strong>ML systems often directly interact with the <u>ever-changing</u> real world.</strong> Due to volatility in the real world, ML systems require <u>continuous maintenance</u>.</p>
<section id="fixed-thresholds-in-dynamic-systems" class="level4">
<h4 class="anchored" data-anchor-id="fixed-thresholds-in-dynamic-systems"><strong><u>1. Fixed Thresholds in Dynamic Systems</u></strong></h4>
<ol type="1">
<li>For problems that predict whether a given sample is true or not it is required to set a <strong>decision threshold</strong>.</li>
<li>The classic approach is to choose a decision threshold that maximizes a certain metric (precision, recall, etc.).</li>
<li><strong>When a model is retrained on a new data, the previous decision threshold usually becomes invalid.</strong></li>
<li>Manually calculating thresholds is time consuming, so automatic optimization method is required.</li>
<li>Optimal approach is to extract a held-out validation set and use it to automatically calculate the optimal thresholds.</li>
<li><strong>Approach used by the Google Data Scientists can be found in the “<a href="https://www.unofficialgoogledatascience.com/2021/04/why-model-calibration-matters-and-how.html">Unofficial Google Data Science” blog post.</a></strong></li>
<li>My simple approach dealing with decision thresholds - <a href="../../not-so-easy-poc/">blog post</a>.</li>
</ol>
<hr>
</section>
<section id="monitoring-and-testing" class="level4">
<h4 class="anchored" data-anchor-id="monitoring-and-testing"><strong><u>2. Monitoring and Testing</u></strong></h4>
<ol type="1">
<li>Unit tests and end-to-end tests of systems are valuable. However, in the real-world, these tests are simply not enough.</li>
<li><strong><u>Realtime monitoring with automated response</u> to issues is essential for long-term system reliability.</strong></li>
<li><strong>What should be monitored:</strong><br>
<u><strong>a) Prediction Bias</strong></u><br>
<strong>Is the distribution of predicted and observed labels equal?</strong> Although this test is not enough to detect a model that simply outputs average values of label occurrences without regard to the input features, it is surprisingly effective.<br>
<u>For example:</u><br>

<ul>
<li>Real-world behavior changes and, now, the training data distributions are not reflective of reality. <strong>In this case, analyzing predictions for various dimensions (e.g.&nbsp;based on race, gender, age, etc.) can isolate biases.</strong> Further, we can setup automated alerting in case we detect prediction bias.<br>
<br>
<strong><u>b) Action Limits</u></strong><br>
<strong>In real-world ML systems that take actions such as bidding on items, classifying messages as spam, it is useful to set <u>action limits</u> as a sanity check.</strong> “If the system hits a limit for a given action, automated alerts should fire and trigger manual intervention or investigation.”<br>
<u>For example:</u><br>
</li>
<li>Setting maximum number of bids per hour<br>
</li>
<li>Setting maximum ratio of spams to regular emails to be 3/5<br>
<br>
<strong><u>c) Up-Stream Producers</u></strong><br>
When data comes from up-stream producers, the <strong>up-stream producers need to be monitored, tested, and routinely meet objectives</strong> that take into account the downstream ML system. Most importantly, any issues in the up-stream must be propagated to the downstream ML system. And the ML system must also notify its downstream consumers.<br>
</li>
</ul></li>
<li><strong>Issues that occur in real-time</strong> and have impact on the system <strong>should be dealt with automatic measures.</strong> Human intervention can work, but <u>if the issue is time-sensitive it won’t work</u>. Automatic measures are worth the investment.</li>
</ol>
<hr>
</section>
</section>
<section id="other-areas-of-ml-related-debt" class="level3">
<h3 class="anchored" data-anchor-id="other-areas-of-ml-related-debt">Other Areas of ML-related Debt</h3>
<section id="data-testing-debt" class="level4">
<h4 class="anchored" data-anchor-id="data-testing-debt"><strong><u>1. Data Testing Debt</u></strong></h4>
<p><strong>Compared to traditional systems, ML systems replace code with data.</strong> And if the input data is the core component, it makes sense to test the data. Basic sanity checks, as well as complex distribution checks are useful.</p>
</section>
<section id="reproducibility-debt" class="level4">
<h4 class="anchored" data-anchor-id="reproducibility-debt"><strong><u>2. Reproducibility Debt</u></strong></h4>
<p><strong>Reproducibility of experiments is an important principle of the scientific method.</strong> However, randomized algorithms, parallel learning and complex real-world interactions make it difficult to reproduce results.</p>
</section>
<section id="process-management-debt" class="level4">
<h4 class="anchored" data-anchor-id="process-management-debt"><strong><u>3. Process Management Debt</u></strong></h4>
<p><strong>Mature systems may have hundreds of models running in parallel.</strong> <u>Updating configurations of many similar models safely and automatically, managing and assigning resources among models with different business properties, visualizing and detecting issues in the flow of data, and being able to recover the system from production issues are important features of a reliable ML system.</u></p>
</section>
<section id="cultural-debt" class="level4">
<h4 class="anchored" data-anchor-id="cultural-debt"><strong><u>4. Cultural Debt</u></strong></h4>
<p><strong>To keep an ML system healthy, create teams with diverse strengths in ML research and engineering.</strong><br>
Team culture should reward:<br>
- Deletion of features<br>
- Reduction of complexity<br>
- Improvements in reproducibility<br>
- Improvements in stability<br>
- Monitoring as well as improvements in accuracy.</p>
<hr>
</section>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion"><strong>Conclusion</strong></h3>
<ol type="1">
<li><strong>Speed is not evidence of low technical debt or good practices</strong> because the <strong><u>real cost of debt becomes apparent over time</u></strong>.</li>
<li><strong>Useful questions for detecting ML technical debt:</strong><br>

<ul>
<li>How easily can an entirely new algorithmic approach be tested at full scale?<br>
</li>
<li>What is the transitive closure of all data dependencies? <u>Transitive closure</u> gives you the set of all places you can get to, from any starting place.<br>
</li>
<li>How precisely can the impact of a new change to the system be measured?<br>
</li>
<li>Does improving one model or signal degrade others?<br>
</li>
<li>How quickly can new members of the team be brought up to speed?</li>
</ul></li>
<li><strong>Key areas that will help reduce technical debt:</strong><br>

<ul>
<li>Maintainable ML<br>
</li>
<li>Better abstractions<br>
</li>
<li>Testing methodologies<br>
</li>
<li>Design patterns</li>
</ul></li>
<li><strong>Engineers and researchers both need to be aware of technical debt.</strong> “Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice.”</li>
<li><strong>Dealing with technical debt can only be achieved by a change in team culture.</strong> “Recognizing, prioritizing, and rewarding this effort is important for the long term health of successful ML teams.”</li>
</ol>
<hr>
<p>And this is it!</p>
<p>In this post, I have tried to summarize the key points of the prominent paper “<a href="https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf">Hidden Technical Debt in Machine Learning Systems”</a> by Google.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/05/avery-evans-NOm4f0xx2bU-unsplash.jpg" class="img-fluid figure-img"></p>
<figcaption>“The Thinker” - by Avery Evans, unsplash.com</figcaption>
</figure>
</div>


</section>

 ]]></description>
  <category>mlops</category>
  <guid>https://bilguunbatsaikhan.com/posts/deep-dive-into-mlops/</guid>
  <pubDate>Thu, 13 May 2021 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Data engineering: simple and complex data pipelines</title>
  <link>https://bilguunbatsaikhan.com/posts/data-engineering/</link>
  <description><![CDATA[ 




<p>Most of my previous work consisted of various data analysis and ML-related tasks.</p>
<p>As of recently, I have been working on tasks related to data engineering, so I have decided to learn more about it. I have stumbled upon Chris Riccomini’s talk @<a href="https://qconsf.com/">QConSanFrancisco</a> and have learned quite a few terminologies and concepts.</p>
<p>In this blog I would like to summarize the key points from Chris’ talk.<br>
All credits go to Chris Riccomini (<a href="https://www.youtube.com/watch?v=ZZr9oE4Oa5U">Link to the talk</a>, L<a href="https://www.linkedin.com/in/riccomini/">inkedin</a>, <a href="https://twitter.com/criccomini?s=20">Twitter</a>).</p>
<hr>
<p><strong>What is the role of a data engineer?</strong><br>
- Data engineer’s job is to help an organization <strong>move (streaming or data pipelines)</strong> and <strong>process</strong> (data warehouses (DWH)) data.</p>
<blockquote class="blockquote">
<p>Data engineers build tools, infrastructure, frameworks and services.</p>
</blockquote>
<blockquote class="blockquote">
<p>Data engineering is much closer to software engineering than it is to a data science.</p>
</blockquote>
<p>- <a href="https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/">“The rise of the data engineer”</a>, Maxime Beauchemin, Preset.</p>
<hr>
<p>Data engineers are primarily involved with building data pipelines.</p>
<p><strong>There are 6 stages in an organization’s data pipeline:</strong><br>
1. None<br>
2. Batch<br>
3. Realtime<br>
4. Integration<br>
5. Automation<br>
6. Decentralization</p>
<hr>
<p><strong><u>1. None (Monolith DB)</u></strong></p>
<p>Structure:<br>
1. Single large DB<br>
2. Users access the same DB<br>
<br>
<strong>Pros/cons:</strong></p>
<ol type="1">
<li><u>Pros:</u><br>

<ul>
<li>Simple</li>
</ul></li>
<li><u>Cons:</u><br>

<ul>
<li>Queries time out<br>
</li>
<li>Users impact each other<br>
</li>
<li>MySQL doesn’t have complex SQL functions<br>
</li>
<li>Report generation are broken</li>
</ul></li>
</ol>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-12.00.18.png" class="img-fluid"></p>
<hr>
<p><u><strong>2. Batch (DWH + Scheduler)</strong></u></p>
<p>Structure:<br>
1. In-between the user and the DB we put a DWH<br>
2. To get data from the DB to the DWH, we put a scheduler to periodically suck the data in.</p>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-12.01.00.png" class="img-fluid"></p>
<p><strong>Pros/cons:</strong></p>
<ol type="1">
<li><u>Pros:</u><br>

<ul>
<li>setup is quick<br>
</li>
<li>best for a basic setup</li>
</ul></li>
<li><u>Cons:</u><br>

<ul>
<li>Large number of Airflow jobs are difficult to maintain<br>
</li>
<li>create_time, modify_time issues arise<br>
</li>
<li>DB Admin’s operations impact the pipeline<br>
</li>
<li>Hard deletes don’t propagate<br>
</li>
<li>MySQL replication latency (the amount of time it takes for a transaction that occurs in the primary database to be applied to the replicate database) causes data quality issues<br>
</li>
<li>Periodic loads cause occasional MySQL timeouts</li>
</ul></li>
</ol>
<hr>
<p>Transition to <strong><u>Realtime</u></strong> if:<br>
1. Loads are taking too long<br>
2. Pipelines are no longer stable<br>
3. Many complicated workflows<br>
4. Data latency (the time it takes for data to travel from one place to another) is becoming issue<br>
5. Data engineering is your full-time job<br>
6. Your organization uses Apache Kafka (<strong>stream processing tool</strong> that provides a <strong>unified</strong>, <strong>high-throughput</strong>, <strong>low-latency</strong> platform for handling <strong>real-time</strong> data feeds.)</p>
<hr>
<p><strong><u>3. Realtime (Kafka)</u></strong></p>
<p>Structure:<br>
1. Change Airflow to Debezium (Tool for change data capture. Start it up, point it at your data sources, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases.)</p>
<blockquote class="blockquote">
<p>Change Data Capture is github for DB changes.</p>
</blockquote>
<blockquote class="blockquote">
<p>Debezium data sources: MongoDB, MySQL, PostgreSQL, SQL Server, Oracle, Cassandra</p>
</blockquote>
<p>2. Operational complexity has gone up</p>
<p>3. KCBQ - Kafka Connects to BigQuery (takes data from Kafka and uploads it into BigQuery)</p>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-12.19.23.png" class="img-fluid"></p>
<hr>
<p>Transition to <strong><u>Integration</u></strong> if:<br>
1. You have many microservices<br>
2. You have a diverse DB ecosystem<br>
3. You have a team of data engineers<br>
4. You have a mature SRE organization (SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks)</p>
<hr>
<p><strong><u>4. Integration (Advanced topic)</u></strong></p>
<p>Structure:<br>
1. Services with DB<br>
2. Streaming platform (Kafka) and DWH<br>
3. Different types of DBs (NoSQL, NewSQL, GraphDB…)</p>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-13.27.24.png" class="img-fluid"><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-13.28.41.png" class="img-fluid"></p>
<hr>
<p><strong><u>Why do we need such a complex pipeline?</u></strong><br>
<br>
<strong><a href="https://en.wikipedia.org/wiki/Metcalfe%27s_law#:~:text=Metcalfe's%20law%20states%20that%20the,the%20system%20(n2).&amp;text=Only%20later%20with%20the%20globalization,was%20to%20describe%20Ethernet%20connections.">Metcalfe’s law</a></strong> states that the value of a telecommunications network is proportional to the square of the number of connected users of the system (the value of a network increases with more nodes and edges you add into it).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/metcalfe.png" class="img-fluid figure-img"></p>
<figcaption>Twotelephonescan make only oneconnection, five can make 10 connections, and twelve can make 66 connections.</figcaption>
</figure>
</div>
<p><strong>Pros/cons:</strong></p>
<ol type="1">
<li><u>Pros:</u><br>

<ul>
<li>If you would like to test a new realtime system, it becomes relatively easy to do so. –&gt; Because your data is very portable.<br>
</li>
<li>Easy to switch Cloud vendors<br>
</li>
<li>Improves infrastructure agility. Easy to plug-in a new system, feed with data and test it.</li>
</ul></li>
<li><u>Cons:</u><br>

<ul>
<li>Add/create/configure/grant/deploy manual work persists!<br>
</li>
<li>Manual work eats up time…</li>
</ul></li>
</ol>
<hr>
<p>Transition to <strong><u>Automation</u></strong> if:<br>
1. Your SREs can’t keep up<br>
2. Manual work is taking a lot of time</p>
<hr>
<p><strong><u>5. Automation</u></strong></p>
<p>Structure:<br>
1. Automated Data Management added<br>
2. Automated Operations added</p>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-14.08.43.png" class="img-fluid"></p>
<hr>
<p><strong>a) Automated Data Management</strong></p>
<p>Automation helps with data management:<br>
1. Who gets access to the data once it is loaded ?<br>
2. How long can the data exist (persist only 3 years –&gt; removed)?<br>
3. Is this data allowed in this system (sensitive information)?<br>
4. Which geographies must the data persist in?<br>
5. Should columns be masked, redacted?</p>
<p>One of the most redundant tasks of data management is creating d<a href="https://wiki.gccollab.ca/index.php?title=Data_Catalog&amp;mobileaction=toggle_view_desktop">ata catalogs</a>.</p>
<p><strong>Contents of a data catalog:</strong><br>
- Location of the data<br>
- Data schema information<br>
- Who owns the data<br>
- Lineage (where the data came from)<br>
- Encryption information (which parts of the data are masked, encrypted)<br>
- Versioning information</p>
<p>Example data catalog by <a href="https://www.amundsen.io/">Lyft’s Amundsen</a> tool:</p>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/amundsen.png" class="img-fluid"></p>
<p>The key point here is that we don’t want to manually input data into the data catalog.</p>
<p>Instead, we should be hooking up our systems to different data catalog generators since they can automatically generate the metadata (schema, ownership, evolution, etc.).</p>
<hr>
<p><strong>b) Automated Operations:</strong></p>
<p>User management automations:<br>
1. New user access<br>
2. New data access<br>
3. Service account access<br>
4. Temporary access<br>
5. Unused access</p>
<p>Detecting violations via automations:<br>
1. Auditing<br>
2. Data loss prevention (GCP Data Loss Prevention (DLP))</p>
<p>For example, we can run DLP checks to detect whether sensitive information (phone number, SSN, email, etc.) exists in the data or not. This protects us from violating regulations.</p>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-14.58.35.png" class="img-fluid"></p>
<p><u>Even after automating all of the above, data engineers still have to configure and deploy.</u></p>
<hr>
<p>Transition to <strong><u>Decentralization</u></strong> if:<br>
1. You have a fully automated realtime data pipeline<br>
2. People still ask the data engineers to load some data</p>
<hr>
<p><strong><u>6. Decentralization</u></strong></p>
<p>Structure:<br>
1. Multiple DWHs<br>
2. Different groups administer and manage their own DWH<br>
3. From monolith to <strong>micro-warehouses</strong></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-15.08.36.png" class="img-fluid figure-img"></p>
<figcaption>-</figcaption>
</figure>
</div>
<hr>
<p><strong>What is likely to be considered a full data pipeline Decentralization?</strong><br>
1. Polished tools are exposed to everyone<br>
2. Security and compliance manage the access and policies<br>
3. Data engineers manage data tools and infrastructure<br>
4. Everyone manages data pipelines and DWHs</p>
<hr>
<p><strong><u>Conclusion:</u></strong></p>
<p>Modern Data Pipeline structures by Chris Riccomini:<br>
1. Realtime data integration<br>
2. Streaming platforms<br>
3. Automated data management<br>
4. Automated operations<br>
5. Decentralized DWHs and pipelines</p>
<p><img src="https://bilguunbatsaikhan.com/images/2021/04/Screen-Shot-2021-04-28-at-15.16.55.png" class="img-fluid"></p>
<hr>
<p>As a final remark, I would like to say that I truly enjoyed Chris’ speech.</p>
<p>It made me truly appreciate all the hard work put by data engineers, not mentioning the complexity of bits and pieces.</p>
<p>For anyone who is reading this post, I highly recommend to go and watch Chris’ talk.</p>
<p>Cheers and stay safe!</p>



 ]]></description>
  <category>data-engineering</category>
  <guid>https://bilguunbatsaikhan.com/posts/data-engineering/</guid>
  <pubDate>Mon, 12 Apr 2021 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Takeaways from Kaggle’s “Jane Street Market Prediction” competition</title>
  <link>https://bilguunbatsaikhan.com/posts/takeaways-from-kaggle-jane-street-market-prediction-competition/</link>
  <description><![CDATA[ 




<p>Recently, I have spent my evenings participating in the Kaggle’s “Jane Street Market Prediction” competition.</p>
<p>To preserve the know-how acquired from the competition, I have written this blog post.</p>
<p>If you would like to learn more about the competition click <a href="https://www.kaggle.com/c/jane-street-market-prediction/overview">here</a>.</p>
<section id="contents" class="level3">
<h3 class="anchored" data-anchor-id="contents">Contents</h3>
<ol type="1">
<li>Anonymized data set know-how</li>
<li>Cross-validation know-how</li>
<li>Neural network optimization with Keras-tuner</li>
<li>Keras fast inference know-how</li>
</ol>
</section>
<section id="anonymized-data-set-know-how" class="level3">
<h3 class="anchored" data-anchor-id="anonymized-data-set-know-how">Anonymized data set know-how</h3>
<p>Since we have been provided with an anonymized data set, it was difficult to perform EDA.</p>
<p>The specific data engineering methodology of Jane Street is one of their primary assets and that is why they have anonymized their data set. Hence, I didn’t spend much time on de-anonymization, considering it was unethical.</p>
<p>However, for those who are interested in how to start de-anonymizing a data set, here are example de-anonymization notebooks shared by <a href="https://www.kaggle.com/gregorycalvez">Gregory Calvez</a> during the competition:</p>
<ol type="1">
<li><a href="https://www.kaggle.com/gregorycalvez/de-anonymization-buy-sell-net-gross" class="uri">https://www.kaggle.com/gregorycalvez/de-anonymization-buy-sell-net-gross</a></li>
<li><a href="https://www.kaggle.com/gregorycalvez/de-anonymization-price-quantity-stocks" class="uri">https://www.kaggle.com/gregorycalvez/de-anonymization-price-quantity-stocks</a></li>
<li><a href="https://www.kaggle.com/gregorycalvez/de-anonymization-time-aggregation-tags" class="uri">https://www.kaggle.com/gregorycalvez/de-anonymization-time-aggregation-tags</a></li>
<li><a href="https://www.kaggle.com/gregorycalvez/de-anonymization-min-max-and-time" class="uri">https://www.kaggle.com/gregorycalvez/de-anonymization-min-max-and-time</a></li>
</ol>
<p>The author of the notebook explains his mindset throughout the notebook in clear and concise comments.</p>
<hr>
</section>
<section id="cross-validation-know-how" class="level3">
<h3 class="anchored" data-anchor-id="cross-validation-know-how">Cross-validation know-how</h3>
<p>We have been given 2 years of time-series data.</p>
<p>To avoid information leakage, we couldn’t randomly split the data into cross-validation splits.</p>
<p>The main cross-validation methodology used in the competition was referred to as “Purged Group Time Series Split”. Here is how we split the data:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://bilguunbatsaikhan.com/images/2021/03/expanding-window-ts.png" class="img-fluid figure-img"></p>
<figcaption>Credits to https://eng.uber.com/omphalos/</figcaption>
</figure>
</div>
<p>The intuition is that we iteratively expand the training data size over the entire history of a time series and repeatedly test against a forecasting window, without dropping older data points (Learn more about the method <a href="https://eng.uber.com/omphalos/">here</a>).</p>
<p>Here is my implementation of the “Purged Group Time Series Split”:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> custom_cv_split(df, n_splits<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, date_column_name):</span>
<span id="cb1-5">    date_splits <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array_split(df[date_column_name].unique(), n_splits)</span>
<span id="cb1-6">    train_ixs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb1-7">    test_ixs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb1-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> split <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> date_splits[:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]:</span>
<span id="cb1-9">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(train_ixs) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb1-10">            curr_tr_ix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train_ixs[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>].copy()</span>
<span id="cb1-11">            curr_tr_ix.extend(df[df[date_column_name].isin(split)].index.values.tolist())</span>
<span id="cb1-12">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb1-13">            curr_tr_ix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[df[date_column_name].isin(split)].index.values.tolist()</span>
<span id="cb1-14">        train_ixs.append(curr_tr_ix)</span>
<span id="cb1-15">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> split <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> date_splits[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>:]:</span>
<span id="cb1-16">        curr_ts_ix <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[df[date_column_name].isin(split)].index.values.tolist()</span>
<span id="cb1-17">        test_ixs.append(curr_ts_ix)</span>
<span id="cb1-18">        </span>
<span id="cb1-19">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> train_ixs:</span>
<span id="cb1-20">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Training size: "</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(i), i[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>], i[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>:])</span>
<span id="cb1-21">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>()    </span>
<span id="cb1-22">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> test_ixs:</span>
<span id="cb1-23">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Test size: "</span>, <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(i), i[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>], i[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>:])</span>
<span id="cb1-24">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(train_ixs, test_ixs))</span></code></pre></div></div>
<p>If you face any bugs, have fun debugging it. I am sure it will help you understand the code!</p>
<hr>
</section>
<section id="neural-network-optimization-with-keras-tuner" class="level3">
<h3 class="anchored" data-anchor-id="neural-network-optimization-with-keras-tuner">Neural network optimization with Keras-tuner</h3>
<p>Keras-tuner package has made it easy and intuitive to tune neural network parameters! Their <a href="https://www.tensorflow.org/tutorials/keras/keras_tuner">documentation</a> is also quite simple.</p>
<p>However, when it comes to using custom cross-validation sets, we have to apply a small trick by creating and modifying the keras-tuner “kt.engine.tuner.Tuner” class:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> CVTuner(kt.engine.tuner.Tuner):</span>
<span id="cb2-2">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> run_trial(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, trial, X, y, splits, batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">32</span>, epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,callbacks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>):</span>
<span id="cb2-3">        val_losses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb2-4">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> train_indices, test_indices <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> splits:</span>
<span id="cb2-5">            X_train, X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [x[train_indices] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> x <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> X], [x[test_indices] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> x <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> X]</span>
<span id="cb2-6">            y_train, y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [a[train_indices] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> a <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> y], [a[test_indices] <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> a <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> y]</span>
<span id="cb2-7">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(X_train) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>:</span>
<span id="cb2-8">                X_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb2-9">                X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_test[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb2-10">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(y_train) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>:</span>
<span id="cb2-11">                y_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> y_train[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb2-12">                y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> y_test[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb2-13">            </span>
<span id="cb2-14">            model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.hypermodel.build(trial.hyperparameters)</span>
<span id="cb2-15">            hist <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.fit(X_train,y_train,</span>
<span id="cb2-16">                      validation_data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(X_test,y_test),</span>
<span id="cb2-17">                      epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>epochs,</span>
<span id="cb2-18">                        batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>batch_size,</span>
<span id="cb2-19">                      callbacks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>callbacks)</span>
<span id="cb2-20">            </span>
<span id="cb2-21">            val_losses.append(np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">max</span>([hist.history[k]) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> k <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> hist.history]) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#change this based on your metric, we use auc --&gt; so we want to get maximum value</span></span>
<span id="cb2-22">        val_losses <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.asarray(val_losses)</span>
<span id="cb2-23">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.oracle.update_trial(trial.trial_id, {k:np.mean(val_losses[:,i]) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i,k <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(hist.history.keys())})</span>
<span id="cb2-24">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.save_model(trial.trial_id, model)</span></code></pre></div></div>
<p>Make sure you understand each line, so that you can customize it later for your own use case.</p>
<p>As a next step, we can directly use our “CVTuner” class as:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">tuner <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> CVTuner(</span>
<span id="cb3-2">    hypermodel<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>model_fn, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#keras model definition function</span></span>
<span id="cb3-3">    oracle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>kt.oracles.BayesianOptimization(</span>
<span id="cb3-4">        objective<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> kt.Objective(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'val_loss'</span>, direction<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'min'</span>),</span>
<span id="cb3-5">        num_initial_points<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb3-6">        max_trials<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>,</span>
<span id="cb3-7">        seed<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>SEED</span>
<span id="cb3-8">        )</span>
<span id="cb3-9">)</span>
<span id="cb3-10">        </span>
<span id="cb3-11">tuner.search(</span>
<span id="cb3-12">    (X,), </span>
<span id="cb3-13">    (X,y), </span>
<span id="cb3-14">    splits<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>splits, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#here we pass the CV folds</span></span>
<span id="cb3-15">    batch_size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4096</span>,</span>
<span id="cb3-16">    epochs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>,</span>
<span id="cb3-17">    callbacks<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[EarlyStopping(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'val_loss'</span>,patience<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)]</span>
<span id="cb3-18">)</span></code></pre></div></div>
<p>And that is it!</p>
<p>With the help of the keras-tuner package, we can setup the hyperparameter tuning in only a few lines of code.</p>
<hr>
</section>
<section id="keras-fast-inference-know-how" class="level3">
<h3 class="anchored" data-anchor-id="keras-fast-inference-know-how">Keras fast inference know-how</h3>
<p>To simulate a real-world High-Frequency-Trading (HFT) prediction scenario, the organizers have set a time limit of ~60 predictions/second.</p>
<p>This made heavy feature engineering almost impossible, leading us to focus on methods that improve the prediction/second rate.</p>
<p>One of the interesting methods was to make our Keras NN model inference time ~3-4 times faster by making our model “LiteModel”:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> LiteModel:</span>
<span id="cb4-2">    </span>
<span id="cb4-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@classmethod</span></span>
<span id="cb4-4">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> from_file(cls, model_path):</span>
<span id="cb4-5">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> LiteModel(tf.lite.Interpreter(model_path<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>model_path))</span>
<span id="cb4-6">    </span>
<span id="cb4-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">@classmethod</span></span>
<span id="cb4-8">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> from_keras_model(cls, kmodel):</span>
<span id="cb4-9">        converter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tf.lite.TFLiteConverter.from_keras_model(kmodel)</span>
<span id="cb4-10">        tflite_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> converter.convert()</span>
<span id="cb4-11">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> LiteModel(tf.lite.Interpreter(model_content<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>tflite_model))</span>
<span id="cb4-12">    </span>
<span id="cb4-13">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, interpreter):</span>
<span id="cb4-14">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> interpreter</span>
<span id="cb4-15">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.allocate_tensors()</span>
<span id="cb4-16">        input_det <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.get_input_details()[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb4-17">        output_det <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.get_output_details()[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb4-18">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.input_index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> input_det[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"index"</span>]</span>
<span id="cb4-19">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.output_index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> output_det[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"index"</span>]</span>
<span id="cb4-20">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.input_shape <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> input_det[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shape"</span>]</span>
<span id="cb4-21">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.output_shape <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> output_det[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"shape"</span>]</span>
<span id="cb4-22">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.input_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> input_det[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dtype"</span>]</span>
<span id="cb4-23">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.output_dtype <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> output_det[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dtype"</span>]</span>
<span id="cb4-24">        </span>
<span id="cb4-25">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> predict(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, inp):</span>
<span id="cb4-26">        inp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> inp.astype(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.input_dtype)</span>
<span id="cb4-27">        count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> inp.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb4-28">        out <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.zeros((count, <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.output_shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]), dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.output_dtype)</span>
<span id="cb4-29">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(count):</span>
<span id="cb4-30">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.set_tensor(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.input_index, inp[i:i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])</span>
<span id="cb4-31">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.invoke()</span>
<span id="cb4-32">            out[i] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.get_tensor(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.output_index)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb4-33">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> out</span>
<span id="cb4-34">    </span>
<span id="cb4-35">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> predict_single(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, inp):</span>
<span id="cb4-36">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">""" Like predict(), but only for a single record. The input data can be a Python list. """</span></span>
<span id="cb4-37">        inp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array([inp], dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.input_dtype)</span>
<span id="cb4-38">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.set_tensor(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.input_index, inp)</span>
<span id="cb4-39">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.invoke()</span>
<span id="cb4-40">        out <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.interpreter.get_tensor(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.output_index)</span>
<span id="cb4-41">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> out[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span></code></pre></div></div>
<p>And finally, transform our model using the “TFLite” class:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> LiteModel.from_keras_model(model)</span>
<span id="cb5-2">model.predict(X_test)</span></code></pre></div></div>
<p>By doing this simple step, we have been able to drastically increase our predictions/second. Approximately 3-4 times!!!</p>
<p>However, one thing to note here is that since “TFLite” is an optimization method, there might be slight differences in the performance of the transformed model.</p>
<p>Despite this withdrawal, according to my experiments, I did not see a huge drop in the performance, so I used “TFLite” transformation without worrying too much.</p>
<hr>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">Conclusion</h3>
<p>To conclude, I have learned many valuable techniques by participating in the “Jane Street Market Prediction” competition.</p>
<p>The winners of the competition will be announced after 6 months. Until then, throughout these 6 months, our models are tested against real-time market data.</p>
<p>Ultimately, no matter if you win a medal or not, participating in Kaggle competitions is surely the best way to acquire up-to-date State Of The Art (SOTA) ML modeling knowledge!</p>
<p>Therefore, for anyone who is interested in Data Science, Machine Learning or AI in general, I highly encourage you to try to Kaggle.</p>
<p>Thank you for reading and happy Kaggling!</p>


</section>

 ]]></description>
  <category>kaggle</category>
  <category>tips and tricks</category>
  <guid>https://bilguunbatsaikhan.com/posts/takeaways-from-kaggle-jane-street-market-prediction-competition/</guid>
  <pubDate>Fri, 12 Mar 2021 15:00:00 GMT</pubDate>
</item>
<item>
  <title>Not so simple classification.</title>
  <link>https://bilguunbatsaikhan.com/posts/not-so-easy-poc/</link>
  <description><![CDATA[ 




<p>Resources on the internet consider binary classification as a relatively straightforward problem.</p>
<p>I have had the opportunity to work on a Proof-Of-Concept (POC) project, where we had to predict if a user will visit the campaign website given a set of features about the user.</p>
<p>My previous understanding about binary classification has changed dramatically after the experiments. Hopefully, your perspective about binary classification will change too.</p>
<hr>
<p>Conducting variety of experimentations using our data set, we have found several practical techniques that might be extremely useful for the reader.<br>
<br>
Specifically, we have found the answer to 3 critical questions:</p>
<ol type="1">
<li>How to train a model when only 1% of the samples are positive samples?</li>
<li>How to choose the decision threshold of a classification model?</li>
<li>How to evaluate a model, so that the client can easily understand the capabilities of the model?</li>
</ol>
<hr>
<section id="how-to-train-a-model-when-only-1-of-the-samples-are-positive-samples" class="level3">
<h3 class="anchored" data-anchor-id="how-to-train-a-model-when-only-1-of-the-samples-are-positive-samples">1. How to train a model when only 1% of the samples are positive samples?<br>
</h3>
<p>Before diving deep, let’s train a classifier and calculate our baseline scores!</p>
<p>First, let’s create a simple data set:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> make_classification</span>
<span id="cb1-3"></span>
<span id="cb1-4">X, y <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> make_classification(n_samples<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100000</span>, </span>
<span id="cb1-5">                           n_features<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, </span>
<span id="cb1-6">                           n_redundant<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,</span>
<span id="cb1-7">                           n_clusters_per_class<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, </span>
<span id="cb1-8">                           weights<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.98</span>], </span>
<span id="cb1-9">                           flip_y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, </span>
<span id="cb1-10">                           random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>)</span></code></pre></div></div>
<p>By setting weights to 0.98, we are creating an imbalanced data set</p>
<p>For ease of use, let’s convert the arrays into a DataFrame:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">cols <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"feature_"</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>(i) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, X.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)]</span>
<span id="cb2-2">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame.from_records(X, columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>cols)</span>
<span id="cb2-3">df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> y</span></code></pre></div></div>
<p>Let’s split the data into train/test sets to evaluate our models on a common test set:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">pos_samples <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>].sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#extracting positive samples</span></span>
<span id="cb3-2">neg_samples <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> df[df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#extracting negative samples</span></span>
<span id="cb3-3"></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># let's set aside 500 positive and 500 negative samples</span></span>
<span id="cb3-5">test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((pos_samples[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>], neg_samples[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>]), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>).sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).reset_index(drop<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#combine, shuffle, reset indices</span></span>
<span id="cb3-6"></span>
<span id="cb3-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># let's use the rest for training</span></span>
<span id="cb3-8">train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((pos_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>:], neg_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">500</span>:]), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>).sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).reset_index(drop<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#combine, shuffle, reset indices</span></span></code></pre></div></div>
<p>Now let’s train a model using the training set and evaluate on the test set (for simplicity, I am going to use sklearn’s “RandomForestClassifier”):</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.ensemble <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> RandomForestClassifier</span>
<span id="cb4-2"></span>
<span id="cb4-3">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#the more the better, but slower</span></span>
<span id="cb4-4">                               random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#lucky number</span></span>
<span id="cb4-5">                               class_weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"balanced"</span>,</span>
<span id="cb4-6">                               verbose<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb4-7">                               n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).fit(train.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], train[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>])</span></code></pre></div></div>
<p>Make sure, the “class_weight” parameter is set to “balanced”, since we are dealing with an imbalanced classification problem!</p>
<p>Since we are working with an imbalanced data, let’s use sklearn’s “balanced_accuracy_score”, “classification_report” and “plot_confusion_matrix” functionalities to evaluate our model:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> classification_report, balanced_accuracy_score, plot_confusion_matrix</span>
<span id="cb5-2"></span>
<span id="cb5-3">preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.predict(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])</span>
<span id="cb5-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(classification_report(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds))</span>
<span id="cb5-5"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">%"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(balanced_accuracy_score(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)))</span></code></pre></div></div>
<section id="output" class="level5">
<h5 class="anchored" data-anchor-id="output">output:</h5>
<pre class="out"><code>              precision    recall  f1-score   support

           0       0.74      1.00      0.85       500
           1       1.00      0.64      0.78       500

    accuracy                           0.82      1000
   macro avg       0.87      0.82      0.82      1000
weighted avg       0.87      0.82      0.82      1000

Accuracy: 82%</code></pre>
<p>We can predict our target with 82% accuracy, not bad!<br>
Let’s take a look at the confusion matrix:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb7-2">plt.rcParams.update({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'font.size'</span>: <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15</span>})</span>
<span id="cb7-3"></span>
<span id="cb7-4">fig, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>))</span>
<span id="cb7-5">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plot_confusion_matrix(model, test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], values_format <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'.0f'</span>, cmap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.cm.Blues, ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ax)</span></code></pre></div></div>
</section>
<section id="output-1" class="level5">
<h5 class="anchored" data-anchor-id="output-1">output:</h5>
<p><img src="https://bilguunbatsaikhan.com/images/2021/02/Screen-Shot-2021-02-17-at-19.29.37-1.png" class="img-fluid"></p>
<p>The baseline model seems to incorrectly predict positive samples as negative samples…Let’s try to improve our results!</p>
<hr>
</section>
</section>
<section id="bagging" class="level3">
<h3 class="anchored" data-anchor-id="bagging">Bagging</h3>
<p><strong>What is Bagging?</strong><br>
Bagging is a sampling method, commonly used in ensemble learning. It means splitting the training data into several subsets and training a model on each subset. After training the models, each model generates a prediction for a sample and all predictions are averaged to produce the prediction.</p>
<p><strong>How to choose the number of splits?</strong><br>
There is no set value for the number of splits. However, based on how many positive samples there are, I usually choose 5-10 subsets. (Split as long as there is improvement)</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1">pos_samples <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train[train[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>].sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb8-2">neg_samples <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train[train[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb8-3"></span>
<span id="cb8-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#lets split into 5 bags</span></span>
<span id="cb8-5">train_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((pos_samples[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span>], neg_samples[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3000</span>]), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-6">train_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((pos_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">300</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>], neg_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3000</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6000</span>]), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-7">train_3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((pos_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">600</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">900</span>], neg_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6000</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9000</span>]), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-8">train_4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((pos_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">900</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1200</span>], neg_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">9000</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12000</span>]), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-9">train_5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((pos_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1200</span>:], neg_samples[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12000</span>:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">15000</span>]), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span></code></pre></div></div>
<p>Train 5 models for 5 splits:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">bag_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>, class_weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"balanced"</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).fit(train_1.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], train_1[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>])</span>
<span id="cb9-2">bag_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>, class_weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"balanced"</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).fit(train_2.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], train_2[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>])</span>
<span id="cb9-3">bag_3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>, class_weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"balanced"</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).fit(train_3.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], train_3[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>])</span>
<span id="cb9-4">bag_4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>, class_weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"balanced"</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).fit(train_4.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], train_4[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>])</span>
<span id="cb9-5">bag_5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> RandomForestClassifier(n_estimators<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>, class_weight<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"balanced"</span>, n_jobs<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).fit(train_5.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], train_5[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>])</span></code></pre></div></div>
<p><strong>How to combine the predictions of the models?</strong><br>
We can use the “predict_proba” function of our “RandomForestClassifier” model to get the probabilities of each sample belonging to a specific class!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1">probs_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_1.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb10-2">probs_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_2.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb10-3">probs_3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_3.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb10-4">probs_4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_4.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb10-5">probs_5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_5.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span></code></pre></div></div>
<p>Let’s evaluate our models:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1">probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (probs_1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_2<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_3<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_5)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb11-2">preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prob <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> prob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs]</span>
<span id="cb11-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(classification_report(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds))</span>
<span id="cb11-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">%"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(balanced_accuracy_score(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)))</span></code></pre></div></div>
<section id="output-2" class="level5">
<h5 class="anchored" data-anchor-id="output-2">output:</h5>
<pre class="out"><code>              precision    recall  f1-score   support

           0       0.76      1.00      0.87       500
           1       1.00      0.69      0.82       500

    accuracy                           0.85      1000
   macro avg       0.88      0.85      0.84      1000
weighted avg       0.88      0.85      0.84      1000

Accuracy: 84%</code></pre>
<p>2% increase in accuracy…</p>
<p>Let’s take a look at the confusion matrix:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> confusion_matrix</span>
<span id="cb13-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> seaborn <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sns</span>
<span id="cb13-3"></span>
<span id="cb13-4">cm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> confusion_matrix(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)</span>
<span id="cb13-5">ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplot()</span>
<span id="cb13-6">sns.heatmap(cm, annot<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ax, fmt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'g'</span>, cmap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.cm.Blues)</span>
<span id="cb13-7"></span>
<span id="cb13-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># labels, title and ticks</span></span>
<span id="cb13-9">ax.set_xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Predicted labels'</span>)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span>ax.set_ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'True labels'</span>)</span>
<span id="cb13-10">ax.xaxis.set_ticklabels([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>])<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> ax.yaxis.set_ticklabels([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>])</span>
<span id="cb13-11">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.tight_layout()</span></code></pre></div></div>
</section>
<section id="output-3" class="level5">
<h5 class="anchored" data-anchor-id="output-3">output:</h5>
<p><img src="https://bilguunbatsaikhan.com/images/2021/02/Screen-Shot-2021-02-17-at-19.33.17-1.png" class="img-fluid"></p>
<p>Some of the predictions have improved…But false negatives are still high!</p>
<p><strong>Now, you might be wondering if Bagging is worth it or not.<br>
Here comes the interesting part…</strong></p>
<hr>
</section>
</section>
<section id="how-to-choose-the-decision-threshold-of-a-model" class="level3">
<h3 class="anchored" data-anchor-id="how-to-choose-the-decision-threshold-of-a-model">2. How to choose the decision threshold of a model?</h3>
<p>The advantage of the Bagging method is observed when we choose optimal decision thresholds for our classifiers.</p>
<p><strong>But what is a decision threshold?</strong><br>
For example, in a binary classification problem with class labels 0 and 1, with predicted probabilities and a decision threshold of 0.5, the predicted probabilities less than the threshold of 0.5 are assigned to class 0 and values greater than or equal to 0.5 are assigned to class 1.</p>
<ul>
<li>Prediction &lt; 0.5 = Class 0</li>
<li>Prediction &gt;= 0.5 = Class 1</li>
</ul>
<p><strong>Why do we need to optimize the decision thresholds?</strong><br>
<a href="https://stats.stackexchange.com/questions/312119/reduce-classification-probability-threshold" class="uri">https://stats.stackexchange.com/questions/312119/reduce-classification-probability-threshold</a></p>
<p>Okay let’s start!</p>
<section id="baseline-model-with-optimized-decision-thresholds" class="level4">
<h4 class="anchored" data-anchor-id="baseline-model-with-optimized-decision-thresholds">Baseline model with optimized decision thresholds:</h4>
<p>One of the easy ways to optimize decision thresholds, is to simply iterate over all possible decision thresholds:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1">dts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)]</span>
<span id="cb14-2">accs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb14-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> dt <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> dts:</span>
<span id="cb14-4">    probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb14-5">    preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prob <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> dt <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> prob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs]</span>
<span id="cb14-6">    acc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> balanced_accuracy_score(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)</span>
<span id="cb14-7">    accs.append(acc)</span></code></pre></div></div>
<p>Let’s plot the accuracies for each decision threshold:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1">fig<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>))</span>
<span id="cb15-2">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.plot(accs)</span>
<span id="cb15-3">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.xticks([i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(dts))], dts)</span>
<span id="cb15-4">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.grid()</span>
<span id="cb15-5">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.tight_layout()</span>
<span id="cb15-6">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Decision thresholds"</span>)</span>
<span id="cb15-7">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracies"</span>)</span></code></pre></div></div>
<section id="output-4" class="level5">
<h5 class="anchored" data-anchor-id="output-4">output:</h5>
<p><img src="https://bilguunbatsaikhan.com/images/2021/02/Screen-Shot-2021-02-17-at-19.34.18-1.png" class="img-fluid"></p>
<p>We can see that decision threshold of 0.1 (very small) yields the best accuracy!</p>
<p>Let’s evaluate:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1">probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb16-2">preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prob <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> prob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs]</span>
<span id="cb16-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(classification_report(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds))</span>
<span id="cb16-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">%"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(balanced_accuracy_score(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)))</span></code></pre></div></div>
</section>
<section id="output-5" class="level5">
<h5 class="anchored" data-anchor-id="output-5">output:</h5>
<pre class="out"><code>              precision    recall  f1-score   support

           0       0.83      0.99      0.91       500
           1       0.99      0.80      0.89       500

    accuracy                           0.90      1000
   macro avg       0.91      0.90      0.90      1000
weighted avg       0.91      0.90      0.90      1000

Accuracy: 89%</code></pre>
<p>Baseline model’s accuracy has improved from 82% to 89%! Great!<br>
How about the confusion matrix?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1">cm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> confusion_matrix(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)</span>
<span id="cb18-2">ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplot()</span>
<span id="cb18-3">sns.heatmap(cm, annot<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ax, fmt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'g'</span>, cmap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.cm.Blues)</span>
<span id="cb18-4">ax.set_xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Predicted labels'</span>)</span>
<span id="cb18-5">ax.set_ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'True labels'</span>)</span>
<span id="cb18-6">ax.xaxis.set_ticklabels([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>])<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> ax.yaxis.set_ticklabels([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>])</span>
<span id="cb18-7">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.tight_layout()</span></code></pre></div></div>
</section>
<section id="output-6" class="level5">
<h5 class="anchored" data-anchor-id="output-6">output:</h5>
<p><img src="https://bilguunbatsaikhan.com/images/2021/02/Screen-Shot-2021-02-17-at-19.36.02-1.png" class="img-fluid"></p>
<p>Looks like we are getting somewhere. False negative have decreased!</p>
<p>Let’s try the same optimization for the Bagging method!</p>
</section>
</section>
<section id="bagged-models-with-optimized-decision-thresholds" class="level4">
<h4 class="anchored" data-anchor-id="bagged-models-with-optimized-decision-thresholds">Bagged models with optimized decision thresholds:</h4>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1">dts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [i<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)]</span>
<span id="cb19-2">accs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb19-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> dt <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> dts:</span>
<span id="cb19-4">    probs_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_1.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb19-5">    probs_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_2.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb19-6">    probs_3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_3.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb19-7">    probs_4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_4.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb19-8">    probs_5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_5.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb19-9">    probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (probs_1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_2<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_3<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_5)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb19-10">    preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prob <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> dt <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> prob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs]</span>
<span id="cb19-11">    acc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> balanced_accuracy_score(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)</span>
<span id="cb19-12">    accs.append(acc)</span></code></pre></div></div>
<p>Let’s plot the results:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1">fig<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>))</span>
<span id="cb20-2">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.plot(accs)</span>
<span id="cb20-3">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.xticks([i <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(dts))], dts)</span>
<span id="cb20-4">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.grid()</span>
<span id="cb20-5">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.tight_layout()</span>
<span id="cb20-6">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Decision thresholds"</span>)</span>
<span id="cb20-7">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracies"</span>)</span></code></pre></div></div>
<section id="output-7" class="level5">
<h5 class="anchored" data-anchor-id="output-7">output:</h5>
<p><img src="https://bilguunbatsaikhan.com/images/2021/02/Screen-Shot-2021-02-17-at-19.36.59-1.png" class="img-fluid"></p>
<p>Similar to the baseline model, 0.1 seems to be the optimal decision threshold for maximizing accuracy!</p>
<p>Let’s evaluate:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1">probs_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_1.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb21-2">probs_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_2.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb21-3">probs_3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_3.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb21-4">probs_4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_4.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb21-5">probs_5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_5.predict_proba(test.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb21-6">probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (probs_1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_2<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_3<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_5)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb21-7">preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prob <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> prob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs]</span>
<span id="cb21-8"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(classification_report(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds))</span>
<span id="cb21-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">%"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(balanced_accuracy_score(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)))</span></code></pre></div></div>
</section>
<section id="output-8" class="level5">
<h5 class="anchored" data-anchor-id="output-8">output:</h5>
<pre class="out"><code>              precision    recall  f1-score   support

           0       0.88      0.96      0.92       500
           1       0.96      0.86      0.91       500

    accuracy                           0.91      1000
   macro avg       0.92      0.91      0.91      1000
weighted avg       0.92      0.91      0.91      1000

Accuracy: 91%</code></pre>
<p>We have reached 91% accuracy using Bagging!</p>
<p>How about the confusion matrix?</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1">cm <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> confusion_matrix(test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>], preds)</span>
<span id="cb23-2">ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplot()</span>
<span id="cb23-3">sns.heatmap(cm, annot<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, ax <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ax, fmt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'g'</span>, cmap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.cm.Blues)</span>
<span id="cb23-4">ax.set_xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Predicted labels'</span>)</span>
<span id="cb23-5">ax.set_ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'True labels'</span>)</span>
<span id="cb23-6">ax.xaxis.set_ticklabels([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>])<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">;</span> ax.yaxis.set_ticklabels([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'1'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'0'</span>])</span>
<span id="cb23-7">_<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>plt.tight_layout()</span></code></pre></div></div>
</section>
<section id="output-9" class="level5">
<h5 class="anchored" data-anchor-id="output-9">output:</h5>
<p><img src="https://bilguunbatsaikhan.com/images/2021/02/Screen-Shot-2021-02-17-at-19.38.11-1.png" class="img-fluid"></p>
<p>We have managed to improve our model’s false negative predictions quite significantly!</p>
<p><strong>Before, it was difficult to judge the benefits of the Bagging method.</strong><br>
<strong>However, once we have performed a simple decision threshold optimization routine, we are able to see the advantage of using the Bagging sampling method!</strong></p>
<p>There is still room for improvement, please try to improve the accuracy ?.</p>
<p>***If you are interested in optimizing the decision threshold of a multi-class classifier, please let me know! I have a working solution.</p>
<hr>
</section>
</section>
</section>
<section id="how-to-evaluate-a-model-so-that-the-client-can-easily-understand-the-capabilities-of-the-model" class="level3">
<h3 class="anchored" data-anchor-id="how-to-evaluate-a-model-so-that-the-client-can-easily-understand-the-capabilities-of-the-model">3. How to evaluate a model, so that the client can easily understand the capabilities of the model?</h3>
<p>Last, but not least, let’s talk about the interpretability of our results.</p>
<p>It is common for our clients to not have prior knowledge of ML evaluation methods. However, they would like to know if our models are robust or not.<br>
How do we solve this problem?</p>
<p><strong>For binary classification problems, here is a solution:</strong><br>
In the production environment, our classifier is most likely to see different types of data with different distributions per class. Therefore, we can evaluate the capability of our models by simulating different input distributions!</p>
<p>Example:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.metrics <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> accuracy_score</span>
<span id="cb24-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Counter</span>
<span id="cb24-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb24-4"></span>
<span id="cb24-5">init<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb24-6">rows <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb24-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> i <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">11</span>):</span>
<span id="cb24-8">    neg <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> test[test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>].reset_index(drop<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb24-9">    pos <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> test[test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>].reset_index(drop<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb24-10">    neg_sample <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> neg[:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(neg.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>init)]</span>
<span id="cb24-11">    pos_sample <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pos[:<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>((((neg.shape[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>init)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>init)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>init))]</span>
<span id="cb24-12">    combined <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.concat((neg_sample, pos_sample), axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb24-13">    probs_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_1.predict_proba(combined.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb24-14">    probs_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_2.predict_proba(combined.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb24-15">    probs_3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_3.predict_proba(combined.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb24-16">    probs_4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_4.predict_proba(combined.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb24-17">    probs_5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_5.predict_proba(combined.values[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>])[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb24-18">    probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (probs_1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_2<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_3<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_5)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb24-19">    preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prob <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> prob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs]</span>
<span id="cb24-20">    acc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> accuracy_score(preds, combined[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>])</span>
<span id="cb24-21">    neg_preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (Counter(preds)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(preds))<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb24-22">    pos_preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (Counter(preds)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(preds))<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb24-23">    rows.append([np.ceil(init<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>), np.ceil((<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>init)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>), neg_preds, pos_preds, acc])</span>
<span id="cb24-24">    init <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> init <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span></span>
<span id="cb24-25">    </span>
<span id="cb24-26">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame.from_records(rows, columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Actual - distribution"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Actual + distribution"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Predicted - distribution"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Actual + distribution"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy"</span>])</span></code></pre></div></div>
<p>Let’s take a look at the results:</p>
<section id="output-10" class="level5">
<h5 class="anchored" data-anchor-id="output-10">output:</h5>
<p><img src="https://bilguunbatsaikhan.com/images/2021/02/Screen-Shot-2021-02-17-at-19.40.40.png" class="img-fluid"></p>
<p>If we take a look at the first 2 columns in the table, we can see that the model is being evaluated on different input distributions (values are in percentages %).</p>
<p>Moreover, our model is able to accurately predict the positive and negative distributions (3rd and 4th columns, values are in percentages %)!</p>
<p>If we take a look at the “Accuracy” column and analyze from top to bottom, we can see that our model has a bias toward predicting negative samples. However, the overall accuracy seems to be quite high!</p>
<p><strong>Isn’t the above table too complex?</strong><br>
We can perform cross-validation on the test data and calculate the accuracy with 95% confidence interval!</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb25" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1">X_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> test[cols]</span>
<span id="cb25-2">y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> test[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Class"</span>]</span>
<span id="cb25-3"></span>
<span id="cb25-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> KFold</span>
<span id="cb25-5"></span>
<span id="cb25-6">cv <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KFold(n_splits<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12345</span>, shuffle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb25-7">accs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb25-8"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> train_index, test_index <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> cv.split(X_test, y_test):</span>
<span id="cb25-9">    xtest, ytest <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> X_test.iloc[test_index], y_test.iloc[test_index]</span>
<span id="cb25-10">    probs_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_1.predict_proba(xtest)[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb25-11">    probs_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_2.predict_proba(xtest)[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb25-12">    probs_3 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_3.predict_proba(xtest)[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb25-13">    probs_4 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_4.predict_proba(xtest)[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb25-14">    probs_5 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> bag_5.predict_proba(xtest)[:,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb25-15">    probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (probs_1<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_2<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_3<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_4<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span>probs_5)<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb25-16">    preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> prob <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.15</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> prob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> probs]</span>
<span id="cb25-17">    acc <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> balanced_accuracy_score(ytest, preds)</span>
<span id="cb25-18">    accs.append(acc)</span>
<span id="cb25-19">    </span>
<span id="cb25-20">accs_ <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(accs) </span>
<span id="cb25-21"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%0.2f</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;"> (+/- </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%0.2f</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">)"</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%</span> (accs_.mean(), accs_.std() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>))   </span></code></pre></div></div>
<pre class="out"><code>Accuracy: 0.91 (+/- 0.05)</code></pre>
<p>Using this single value for accuracy, the client can easily understand the overall capability of our model!</p>
<hr>
</section>
</section>
<section id="conclusion" class="level3">
<h3 class="anchored" data-anchor-id="conclusion">Conclusion:</h3>
<ol type="1">
<li>In this post, I have tried to introduce an ensemble sampling method called “Bagging”</li>
<li>In the beginning, it was difficult to assess the advantage of using the Bagging method. However, once we have optimized our decision thresholds, we were able to see the benefits</li>
<li>It is common for our clients to require interpretable evaluation metrics. Therefore, I have shared a simple evaluation method for explaining classification models</li>
</ol>


</section>

 ]]></description>
  <category>classification</category>
  <guid>https://bilguunbatsaikhan.com/posts/not-so-easy-poc/</guid>
  <pubDate>Sat, 13 Feb 2021 15:00:00 GMT</pubDate>
</item>
</channel>
</rss>
