Reading notes: Machine Learning: the high-interest credit card of technical debt

This is my reading note for the paper titled “Machine Learning: the hight-interest credit card of technical debt“.

Context

Machine learning is powerful toolkit to build complex systems quickly, but these quick wins does not come for free.

Dilemma: speed of execution and quality of engineering.

Technical debt does tend to compound. Deferring to pay it off results in:

  • increasing costs
  • system brittleness
  • reduced rates of innovation

Traditional method to pay off tech debt:

  • refactoring
  • increase coverage of unit tests
  • delete dead code
  • reduce dependencies
  • tightening APIs
  • improve documentation

At a system-level, a ML model may subtly erode abstraction boundaries.

Unfortunately, it is hard to enforce strict abstraction boundaries for ML system by requiring:

  • strong abstraction boundaries using capsulation and modular design
  • create maintainable code to make it easy for isolated changes and improvements

Arguably the most important reason to use a ML system is precisely that the desired behavior cannot be effectively implemented in software logic without dependency on external data.

Three form of complex models erode boundaries

  • Entanglement
    • ML models are machines to create entanglement
    • CACE principle: change anything changes everything
      • Eg: changing input distribution of certain attributes changes the importance, weights, or the use of remaining attributes.
      • hyperparameters. Changes in regularization strength, learning settings, sampling methods in training, convergence thresholds, etc, will affect many things
    • Mitigation strategy 1: isolate models and serve ensembles
      • Use cases:
        • sub-problems decompose naturally
        • cost of maintaining separate models is outweighed by benefits of enforced modularity
      • Cons:
        • unscalable in large-scale settings
        • entanglement may still be present within a given model
    • Mitigation strategy 2: gain deep insights into behavior of model predictions
      • use high-dimentional visualization tools to see effects across many dimensions and slicing
      • use metrics that operate on a slice-by-slice basis
    • Mitigation strategy 3: use more sophisticated regularization methods to enforce that any changes in prediction performance carry a cost in objective function used in training
      • can be useful but is far from a guarantee
      • may add more debt via increased system complexity than is reduced via decreased entanglement
    • Key takeaways:
      • This issue is innate to ML, regardless of learning algorithms used
      • Shipping the first version of a ML system is easy, but making subsequent improvements is unexpectedly difficult
      • This issue should be weighed carefully against deadline pressures for version 1.0 of any ML system
  • Hidden feedback loops
    • Systems that learn from world behavior are clearly intended to be part of a feedback loop
    • Example: a system to predict the click through rate (CRT) of news headlines
      • rely on user clicks as training labels
      • user clicks in turn depend on previous predictions from the model
      • lead to issues in analyzing system performance
      • system will slowly change behavior
      • gradual changes not visible in quick experiments make it extremely hard to analyze the effect of proposed changes, and add cost to even simple improvement
    • Need to do:
      • look carefully for hidden feedback loops
      • remove those loops whenever feasible
  • Undeclared consumers
    • Without access controls, some consumers may become undeclared consumers, consuming output of a given prediction model as an input to another component of system
    • Undeclared consumers are expensive at best and dangerous at worst
    • The expense is drawn from sudden tight coupling of model A to other parts of stack
      • changes to A will likely cause:
        • impact other parts, sometimes in unintended, poorly understood, or detrimental ways
        • hard and expensive to make any changes to A at all
    • Danger:
      • may introduce additional hidden feedback loop
    • Mitigations:
      • design system to guard against undeclared consumers

Data dependencies cost more than code dependencies

  • Code dependencies can be easy to identify via static analysis, linkage graphs, etc
  • Data dependencies have no similar analysis tools
  • Three forms of data dependencies:
  • Unstable data dependencies
    • “Unstable” means input signals qualitatively change behavior over time
    • Can happen implicitly:
      • input signals come from another ML model itself that updates over time
      • data-dependent lookup table
    • Can happen explicitly:
      • when engineering ownership of input signals is separate from engineering ownership of the models that consumes it
        • According to CASE principle above, “improvements” to input signals may have arbitrary, sometimes deleterious, effects that are costly to diagnose and address
    • Mitigation:
      • crease a versioned copy of a given signal
        • But versioning carries its own cost, such as potential staleness
        • maintaining multiple versions of same signal over time is a contributor to tech debt in its own right
  • Underutilized data dependencies
    • include features or signals that provide little incremental value in terms of accuracy
    • Creep into a ML model in several ways:
      • Legacy features
        • eg: A feature F is included in a model early. Over time, other features are added that make F mostly or entirely redundantly, but this is not detected
      • Bundled features
        • A group of features are added to model together due to deadline pressures or similar effects. This can hide features that add little or no value
      • e-Features
        • it can be tempting to add a new feature to a model that improves accuracy, even when the accuracy gain is very small, or when the complexity overhead might be high
    • Mitigation
      • regularly evaluate the effect of removing individual features from a given model and act on this information whenever possible
      • build cultural awareness about long-term benefit of underutilized dependency cleanup
  • Static analysis of data dependencies
    • On teams with many engineers, or if there are multiple interacting teams, not everyone knows the status of every single feature, and it can be difficult for any individual human to know every last place where the feature was used. In a large company, it may not be easy even to find all the consumers of the dictionary

    • Mitigation
      • A great example:  “Ad click prediction: a view from the trenches” KDD 2013
      • annotate data sources and features
      • run automated checks to ensure all dependencies have appropriate annotations
      • fully resolve dependency trees
  • Correction cascades
    • Context: model a for problem A exists. But a solution for a slightly different problem A’ is required
      • It is tempting to learn a model a'(a) that takes a as input and learns a small correction
      • appear to be fast, low-cost win, easy and quick to crate a first version
      • However, this correction model has created a system dependency on a, making it significantly more expensive to analyze improvements to that model in the future
      • Things get worse of correction models are cascaded, with a model for problem A” learned on top of a’, and so on.
      • Such system may create deadlock
        • the coupled ML system is in a poor local optimum, and no component model may be individually improved
        • the independent development that was initially attractive now becomes a large barrier to progress
    • Mitigation:
      • Augment model a to learn the corrections directly within the same model by adding features that help the model distinguish among the various use-cases
      • At test time, the model may be queried with the appropriate features for the appropriate test distributions.
      • This is not a free solution—the solutions for the various related problems remain coupled via CACE, but it may be easier to make updates and evaluate their impact.

System-level spaghetti

  • Glue code
    • Using self-contained solutions often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages.
    • This glue code design pattern can be costly in the long term, as it tends to freeze a system to the peculiarities of a specific package.
    • While generic systems might make it possible to interchange optimization algorithms, it is quite often refactoring of the construction of the problem space which yields the most benefit to mature systems. The glue code pattern implicitly embeds this construction in supporting code instead of in principally designed components. As a result, the glue code pattern often makes experimentation with other machine learning approaches prohibitively expensive, resulting in an ongoing tax on innovation.
    • Glue code can be reduced by choosing to re-implement specific algorithms within the broader system architecture
    • Problem-specific machine learning code can also be tweaked with problem-specific knowledge that is hard to support in general packages
    • Only a tiny fraction of the code in many machine learning systems is actually doing “machine learning”.
    • When we recognize that a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code, reimplementation rather than reuse of a clumsy API looks like a much better strategy.
  • Pipeline jungles
    • pipeline jungles often appear in data preparation
    • can only be avoided by thinking holistically about data collection and feature extraction
  • Dead experimental codepaths
    • over time, these accumulated codepaths can create a growing debt. Maintaining backward compatibility with experimental codepaths is a burden for making more substantive changes.
    • Knight Capital’s system losing $465 million in 45 minutes apparently because of unexpected behavior from obsolete experimental codepaths [9].
    • periodically re- examine each experimental branch to see what can be ripped out. Very often only a small subset of the possible branches is actually used; many others may have been tested once and abandoned.
    • a redesign and a rewrite of some pieces may be needed periodically in order to move forward efficiently
    • As a real-world anecdote, in a recent cleanup effort of one important machine learning system at Google, it was found possible to rip out tens of thousands of lines of unused experimental code- paths.
  • Configuration debt

Conclusions

  • Pay it off!
  • It may be reasonable to take on moderate technical debt for the benefit of moving quickly in the short term, but this must be recognized and accounted for lest it quickly grow unmanageable.
  • Technical debt is an issue that both engineers and researchers need to be aware of. Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice
  • Paying down technical debt is not always as exciting as proving a new theorem, but it is a critical part of consistently strong innovation.
  • Developing holistic, elegant solutions for complex machine learning systems is deeply rewarding work.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s