Introduction to Multi-Armed Bandits (2019)

142 points by Anon84 a day ago

We employed bandits in a product I worked on. It was selecting which piece of content to show in a certain context, optimizing for clicks. It did a great job, but there were implications that I wish we understood from the start.

There was a constant stream of new content (i.e., arms for the bandits) to choose from. Instead of running manual experiments (e.g., A/B tests or other designs), the bandits would sample the new set of options and arrive at a new optimal mix much more quickly.

But we did want to run experiments with other things around the content that was managed by the bandits (e.g., UI flow, overall layout, other algorithmic things, etc.). It turns out bandits complicate these experiments significantly. Any changes to the context in which the bandits operate lead them to shift things more towards exploration to find a new optimal mix, hurting performance for some period of time.

We had a choice we could make here... treat all traffic, regardless of cohort, as a single universe that the bandits are managing (so they would optimize for the mix of cohorts as a whole). Or we could setup bandit stats for each cohort. If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence. But the optimal mix will likely look different for one cohort vs. another vs. all of them combined. So it's better for experiment validity to isolate the bandits for each cohort. Now small cohorts can take quite a while to converge before we can measure how well things work. All of this puts a real limit on iteration speed.

Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.

Much (all?) of this comes from bandits being feedback loops - these same problems are present in other approaches where feedback loops are used (e.g., control theory based approaches). Feedback mechanisms are incredibly powerful, but they couple things together in ways that can be difficult to tease apart.

crystal_revenge - a day ago

The real trouble with bandits is that people don't bother to look into what the real potential benefit is as far as the target you're optimizing. Despite theoretically loving bandit techniques, I've convinced multiple teams not to go that path because the real advantage of using them is a slightly more optimal mix of people in experiment than if you ran them manually.
At some scales it can make sense, but for the vast majority of products/companies the statistical benefits don't outweigh the added complexity.
Bandits work best for relatively simple optimization choices at very large scale. If you care at all about the information gained in an experiment, it's fairly unlikely bandits are a good choice for you (or any reinforcement learning algorithm at that point).
- rented_mule - a day ago
  
  Agreed. In the case I was describing above, new arms were constantly being introduced (often several times a day for each of hundreds of thousands of scenarios). Manual experiments weren't an option. This also meant we were in a constant state of partial convergence for most scenarios, but the same would be true with experiments.
  How to cull arms, so that there are enough samples for any kind of convergence, is another problem in this setup. We eventually built an ML model to select arms and used bandits to pick between them. This proved "too effective". The arms were all user-generated content. The bandits on top of the model, both setup to maximize clicks was stupidly good at surfacing inappropriate content because it got a lot of clicks. We ended up having to put more safeties on arm selection for certain categories of our content where we had the most inappropriate submissions.
- flashfaffe2 - 16 hours ago
  
  Thanks for the inputs.
  "Bandits work best for relatively simple optimization choices at very large scale"
  Have you considered different methods to address this shortcoming?
  - hruk - 15 hours ago
    
    This is basically Breiman's "two cultures" at play. Do you care about optimizing y-hat, or do you care about doing inference on some parameters in your model? Depends on the business case, typically.
kianN - a day ago

I’ve actually run into the exact same issue. At the time we similarly had to scrap bandits. Since then I’ve had the opportunity to do a fair amount of research into hierarchical dirichelete processes in an unrelated field.
On a random day, a light went off in my head that hierarchy perfectly addresses the stratification vs aggregation problems that arise in bandits. Unfortunately I’ve never had a chance to apply this (and thus see the issues) in a relevant setting since.
- hruk - a day ago
  
  You can do fairly well here with ridge regression as a poor man's hierarchical model. We've used this library's Bayesian ridge regression to support a geo-pricing strategy (and it contains the Dirichlet-Multinomial approach as well): https://github.com/bayesianbandits/bayesianbandits
- CuriouslyC - a day ago
  
  Ahh, hierarchical dirichlet processes. Sounds like you were reading the literature on Bayesian diffusion modelling / diffusion trees. I studied that stuff almost 20 years ago now, really takes me back.
  - kianN - a day ago
    
    Haha I’ve actually never heard of that field. My work was focused on applying Chinese restaurant process models to text analysis. But very curious what you were working on?
    
    CuriouslyC - 13 hours ago
    
    I was using it for bioinformatics to incorporate measurement uncertainty from fluid microarrays into genotype cluster estimates.
fjdjshsh - 17 hours ago

>Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.
One way to peak into the state is to use bayesian models to represent the "belief" state of the bandits. For example, the arm's "utility" can be a linear function of the features of the arm. At each period, you can inspect the coefficients (and their distribution) for each arm.
See this package:
https://github.com/bayesianbandits/bayesianbandits
dr_dshiv - a day ago

“ If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence.”
Can you explain, please?
- hervature - a day ago
  
  Not the OP. I think what they are driving at is that if knowledge is discovered during exploration in cohort A, cohort B can exploit it. Then, the whole A/B test breaks down to which cohort got to benefit more from the bandit learnings.
  - rented_mule - a day ago
    
    Yes, this is exactly the kind of scenario I was alluding to.
    For example, cohorts with very light traffic are likely to get undue benefit as a lot of exploration might be done before the smaller cohort needs to select an arm, so things are closer to convergence.
    Another example is if there are wildly different outcomes between cohorts. More of the exploration will be done in cohorts with more traffic, leading bandit optimizations to fit large cohorts better than lower traffic cohorts.
    Even if you do manage to make things independent, you have to wait for bandit convergence before you know what converged results will look like. That requires being able to measure convergence, which isn't always easy, especially if you don't know to do that before designing the system.
    With all of these problems, we kept bandits, and even expanded their application. At least for the 10 years I was still around. They are incredibly powerful. But there was a lot of "I wish those damned bandits didn't work so well!"
    For anyone who is not aware, A/B tests assume cohorts behave independently of each other. The less true that is, the less reliable the results are. This was even worse for us in parts of our system where there were no bandits, but there was direct interactions between individuals.
BOOSTERHIDROGEN - a day ago

Could you expand on control theory? Is it PID?
- rented_mule - 21 hours ago
  
  I was just mentioning that any feedback loops have the potential downside of coupling things together in ways that make experiments (and analysis, improvements, etc.) difficult.
  I used control theory as an example of another type of feedback loop because we were using controllers in other parts of the system. One was roughly the proportional term of a PID controller. Another was both the proportional and integral terms, but with a lot of domain specific heuristics added on (e.g., ML models to set baselines that help the controllers understand how quickly to move values). This was all incredibly powerful, but also required experiment design and analysis that was far more complicated than we would have liked.
  This was not a ride-sharing company, but you can imagine dynamic pricing for Uber or Lyft working this way. Use a controller to try to keep the number of available riders and drivers in balance. If riders exceed drivers, increase payment to drivers (and therefore cost to riders), thereby attracting more riders. If drivers are sitting idle, decrease payment. Ideally, this is done constantly to respond to time of day, weather, traffic, etc. (with some predictive baseline expectations to keep it from being too slow to respond). When you get it right, riders get the cheapest price that ensures a driver is available. In economic terminology, this is finding the market clearing price. A controller can get you a long way in a hurry on such a system. But it's certainly not the only approach that can work, and it takes a lot to get it right (I have no idea how Uber and Lyft actually do any of this).
  - BOOSTERHIDROGEN - 12 hours ago
    
    Impressive, thanks for sharing that tidbits.

liampulles - 16 hours ago

There is a great video with Jim Manzi on the subject of applying experiments in a business/government context - I think people interested in the subject might enjoy it:

https://www.youtube.com/watch?v=sf0vb4yiZR4

novaleaf - 14 hours ago

I'm only about 5 min into it, but wow. Excellent.

Normal_gaussian - a day ago

Its very easy to implement toy multi-armed bandits, and even getting them into production code isn't so bad. Its the reporting, education, and adding features without breaking everything that gets very hard very quick.

3abiton - a day ago

> Its the reporting, education, and adding features without breaking everything that gets very hard very quick.
What's challenging about that though?
- Normal_gaussian - 9 hours ago
  
  mab's get misinterpreted very easily, particularly because they tend to be useful in places where the non-technical (or low stats knowledge) people are the major stakeholders. Reporting on MAB functionality and output inline with other features without overcomplicating interpretation or leading to easy misinterpretation is an often underestimated task. Then allowing iteration on this without breaking any of the careful presentation or decisions made is a challenge for any MoC process.
- taormina - 21 hours ago
  
  I’ve found this all to be just as straightforward as implementing the MAB in my experience.

jmward01 - a day ago

I used bandits to select a best vendor/model to use on a per customer basis for a task that we had a reward signal for. It worked amazingly well but the more important thing we gained was the ability to accurately estimate the improvement other features/changes had. Getting a hard number for 'feature x improved over previous baseline by y' is a game changer for a lot of product and development work. You only think feature x is a massive improvement, until you see the numbers and realize that the spacing displayed to a user was 10x move valuable than the Really Big Feature you thought was going to make a difference.

codeulike - 17 hours ago

Does the name "Multi-Armed Bandits" have anything to do with "One Armed Bandits" - old style slot machine/fruit machines/gambling machines with a big lever that people would pull?

edit: ah ok yes it does: https://en.wikipedia.org/wiki/Multi-armed_bandit

PanoptesYC - 16 hours ago

Yes. The paper explains the basic model as so:
"We consider the basic model with IID rewards, called stochastic bandits. An algorithm has K possible actions to choose from, a.k.a. arms, and there are T rounds, for some known K and T . In each round, the algorithm chooses an arm and collects a reward for this arm. The algorithm’s goal is to maximize its total reward over the T rounds."

Brystephor - a day ago

Hey, an area I actually work on and use in production! MAB in our case has shown improvements in our A/B metrics. My general takeaways for my experience with MAB:

* we have a single optimization goal (e.g the metric to minimize or maximize). The hard part isnt defining a optimization goal. The hard part is identifying what stakeholders actually want the tradeoff between different metrics. If you have goal A and goal B, where B is more aggressive than A, then getting agreement on which is better is hard. This is a people problem.

* MAB seems to be a good proof of concept that something that can be optimized but isnt an "end game" optimization path.

* MAB for A/B testing is going to mess up your AB data and make everything more difficult. You should have a treatment that uses the MAB algorithm and a treatment that doesnt.

All of the above is for non contextual MAB. I am currently learning about different MAB algorithms although each of them are pretty similar. The ones ive read about are all effectively linear/logistic regression and tbe differences come from exploration mechanism and how uncertainty is represented. Epsilon greedy has no uncertainty, exploration just happens a fixed percentage of time. UCB is optimistic about uncertainty, and the amount of optimism controls exploration. Thompson sampling uses statistical (beta) distributions about uncertainty, exploration happens less as confidence about a particular set of options increases.

Overall its a fun area to work in that is quite different from your typical CRUD development which is a nice change of pace.

Iwan-Zotow - 12 hours ago

Do you have write ups how to use it in production?
- Brystephor - 8 hours ago
  
  No public write ups. What information would you want to see? I think there's two categories: infra ops info, and then product insights and "gotchas"/unintuitive but valid results.

praptak - 18 hours ago

Multi-Armed Bandits running on move trees (aka the UCT algorithm) was the invention that led the foundation to computers cracking Go.

esafak - a day ago

One way to address the https://en.wikipedia.org/wiki/Exploration%E2%80%93exploitati...

BrenBarn - a day ago

Just from a philosophical perspective I think multi-armed bandits are a fascinating idea and a great technique for thinking about many kinds of decision processes.

cdelsolar - 13 hours ago

We use multi armed bandits in scrabble Montecarlo sims to smartly pick the best performing move.