Episode 91 — Weighted Least Squares: Handling Non-Constant Variance in Regression

In Episode ninety one, titled “Weighted Least Squares: Handling Non Constant Variance in Regression,” we focus on a practical fix for a very common regression problem: some observations are simply noisier than others, and treating every data point as equally reliable can pull your fitted line in the wrong direction. When variance differs across cases, ordinary least squares can still produce a fit, but that fit is shaped disproportionately by the loudest, most variable parts of the data. Weighted least squares is a way to acknowledge that reality directly, using weights to emphasize observations that are more reliable and to de emphasize observations that carry more noise. This is not about ignoring inconvenient data, because the goal is still to learn from the full sample, but to learn in proportion to how trustworthy each observation is. When applied with discipline, weighting can stabilize coefficients, reduce systematic distortion, and improve real world predictive accuracy under heteroskedastic conditions.

Before we continue, a quick note: this audio course is a companion to the Data X books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

Weighted least squares, abbreviated as W L S, is a regression fitting approach that assigns different importance to observations during the estimation process. In ordinary least squares, every observation contributes equally to the loss function, and the model chooses coefficients that minimize the sum of squared residuals across all points. In weighted least squares, the model instead minimizes a weighted sum of squared residuals, where each residual is multiplied by a weight that reflects the relative reliability or variance of that observation. Observations with higher weights influence the fitted coefficients more strongly, while observations with lower weights influence them less. The result is a fitted line that is pulled toward the more trusted points and less distorted by points that are expected to be noisy. This is especially helpful when you have reason to believe that error variance is not constant across the range of the target or across different operating conditions.

A standard and principled weighting choice is to set weights inversely proportional to the variance of the errors, because that aligns the fitting objective with the idea of treating high variance observations as less informative. If an observation has a large error variance, it is expected to fluctuate widely even when the model is correct, so forcing the line to chase those fluctuations is not efficient. Inverse variance weighting gives more influence to observations with smaller error variance, where deviations are more likely to reflect model misspecification rather than noise. This aligns with the intuition that a precise measurement should count more than a noisy measurement when you are estimating a relationship. In practical terms, you do not need perfect variance estimates to benefit from this logic, but you do need a defensible rationale for how weights relate to expected noise. The exam expects you to understand the direction of the relationship, meaning higher noise should mean lower weight.

Recognizing variance patterns is the first step toward choosing weights, and a common pattern is that higher target values come with higher variance. In business and operational datasets, variability often grows with scale, so large customers have more variable demand, high traffic periods have more variable latency, and high revenue days have more volatility. When you plot residuals against fitted values and see a funnel shape, where residual spread increases as predictions increase, you are observing heteroskedasticity that may be addressable by weighting. Another pattern is group based variance, where some segments or devices produce noisier measurements than others due to sensor quality, sampling differences, or process variation. Recognizing the pattern matters because weights should reflect that pattern rather than being assigned randomly. A good weighting strategy emerges from observing how noise changes across conditions, not from guessing.

It is also useful to practice deciding when weighted least squares is a better fit than transforming the target, because both approaches can address non constant variance but they do so in different ways. Transforming the target, such as using a logarithm, can stabilize variance by changing the scale on which errors are measured, which can be appropriate when the process is multiplicative or when relative error matters more than absolute error. Weighted least squares, in contrast, keeps the target on its original scale and changes how errors are aggregated, which can be preferable when stakeholders need predictions in the original units and when absolute error costs are asymmetric across ranges. Sometimes transformation and weighting can both work, but choosing between them depends on what you are trying to preserve, such as interpretability in original units or linearity after transformation. The practical point is that weighting is not merely a mathematical trick, it is a modeling choice that embeds a belief about which observations are more reliable. If your domain logic supports that belief, weighted least squares can be the cleaner approach.

A major discipline requirement is avoiding arbitrary weights, because arbitrary weighting can introduce bias just as surely as ignoring heteroskedasticity. Weights should be supported by evidence from the process that generated the data or from clear residual diagnostics that indicate a stable variance pattern. If you weight without evidence, you can end up privileging a subset of observations in a way that reflects convenience or preference rather than reliability. This can distort the fitted relationship, especially if the weighted subset is not representative of the decision environment. In regulated or high stakes settings, arbitrary weighting is also difficult to defend, because you cannot explain why certain cases were effectively muted. The best posture is to treat weights as part of the model specification that must be justified, monitored, and documented. When you cannot justify weights, you should pause and reconsider whether a different model family is more appropriate.

Another subtle but important discipline is fitting weights using training data only, because weights can be learned from data and therefore can create evaluation leakage if they incorporate information from validation or test sets. If you estimate variance patterns using the full dataset, including holdout outcomes, you are letting evaluation data influence the training objective. Even if labels are not used directly in the weighting rule, residual based weight estimation uses the target, and that is a form of learning that must remain confined to training splits. The safe approach is to define the weighting logic based on process knowledge or to estimate it within the training set, then apply the resulting weights when fitting the model for that training split. When using cross validation, this means weight estimation is repeated within each fold, because each fold has a different training subset. This preserves the integrity of evaluation and ensures that any performance improvement is real rather than a consequence of peeking.

Interpreting weighted least squares results requires remembering what the fit is emphasizing, because the coefficients reflect a model that prioritizes reliability over uniform treatment of all observations. You can think of the weighted fit as being pulled toward the regions or groups where the data is more precise, which often produces a line that better reflects typical behavior rather than extreme volatility. This does not mean you have discarded noisy observations, because they still contribute, but their influence is reduced to match their expected informativeness. The interpretation is also important when stakeholders ask why the model seems to favor certain ranges, because the answer is that the model is matching the reliability structure of the data. If the goal is to minimize expected squared error under heteroskedastic noise, this emphasis is appropriate. The key is to align that emphasis with the real decision context, so the model’s priorities match operational priorities.

Comparing weighted least squares to robust standard errors clarifies an important conceptual distinction: weighting changes the fit, while robust errors change the inference. Robust standard errors adjust uncertainty estimates when variance assumptions are violated, but they do not change the coefficient estimates, meaning the fitted line stays the same. Weighted least squares changes the coefficient estimates because it changes the objective function, which means it can improve predictive performance when the original fit was distorted by heteroskedasticity. This distinction matters because teams sometimes apply robust errors and assume they have “fixed” heteroskedasticity, when in reality they have only adjusted confidence intervals. If your goal is better predictions and a better fitting line in the presence of non constant variance, weighting may be the more direct tool. If your goal is valid inference about coefficients while keeping the same fit, robust errors may be sufficient, depending on the context. Understanding which tool affects what prevents you from solving the wrong problem.

Any claimed improvement from weighting must be validated on holdout data, not celebrated based on training residuals, because weighting can make training fit look cleaner without improving generalization. It is easy to choose weights that make the residual plot look prettier on the data used to fit the model, especially if weights were tuned aggressively. The test of usefulness is whether predictions improve under the same evaluation procedure you would use for any model selection decision. This means comparing weighted and unweighted fits using a holdout set or cross validation and looking at metrics aligned to business tolerance, not only at residual shape. It also means checking whether improvements are concentrated in the regions that matter operationally, such as high value cases or typical operating ranges. If weighting reduces error in one region but increases it in another that is more important, the tradeoff may be unacceptable. Validating on holdout data keeps the decision grounded in expected performance rather than in aesthetic diagnostics.

Communicating why some observations deserve less influence is a necessary governance step, because weighting can be misinterpreted as intentionally discounting certain users, segments, or situations. The correct framing is that the model is accounting for measurement noise or process variability, not making a value judgment about which cases matter. For example, if high volume periods have inherently higher variance due to fluctuating load, weighting can prevent the fit from being dominated by that volatility while still learning the underlying trend. If certain sensors are known to be noisier, weighting can reflect sensor reliability rather than penalizing a subgroup. Stakeholders often accept weighting when the rationale is tied to measurement quality and when the impact on decisions is transparent. Clear communication should also include how weighting affects error distribution and how it aligns with operational objectives, because that makes the choice defensible.

Documentation of the weighting rule is essential so that training behavior can be reproduced and so that inference and governance remain consistent over time. A weight rule can be as simple as a function of predicted magnitude or as structured as group based reliability factors, but whatever it is, it must be recorded alongside the model specification. Documentation should include how weights were derived, whether they were estimated from training residuals or defined from process knowledge, and how they are applied in retraining. This prevents subtle drift in the weighting logic that could change model behavior without anyone noticing, and it supports audits when someone asks why a certain case had less influence during training. Reproducibility also matters because weighting interacts with evaluation, and a model trained with different weights is not the same model. Treating weights as first class model parameters keeps the lifecycle disciplined.

The anchor memory for Episode ninety one is that when noise varies, weights stabilize the fit. Noise variation is what creates heteroskedasticity, and heteroskedasticity is what makes equal weighting inefficient and sometimes misleading. Weighting stabilizes the fit by aligning influence with reliability, reducing the chance that volatile observations dominate the estimated relationship. This anchor also reminds you that the purpose is not to make numbers look good, but to better reflect the data generating process. When you can state this clearly, you demonstrate understanding beyond formulas. It is the difference between applying a technique and applying it for the right reason.

To conclude Episode ninety one, titled “Weighted Least Squares: Handling Non Constant Variance in Regression,” choose a case where weighted least squares is appropriate and state your weight logic in plain language. Consider modeling delivery time where low volume routes are stable but high volume routes have highly variable delays due to congestion and batching effects, producing a clear funnel pattern in residuals as predicted time increases. Weighted least squares is appropriate because the high variance observations should not dominate the fit when the goal is a stable estimate of the underlying relationship between route features and delivery time. A defensible weight logic is to assign weights inversely proportional to the estimated variance by range, meaning observations in ranges with larger residual variance receive smaller weights and observations in more stable ranges receive larger weights. You would estimate that variance pattern using training data only, then apply the weights consistently during fitting and validate improvement on holdout data using metrics aligned to business tolerance. Stating the case and the weight logic this way shows you understand weighted least squares as an evidence based response to non constant variance rather than as an arbitrary tuning trick.

Episode 91 — Weighted Least Squares: Handling Non-Constant Variance in Regression
Broadcast by