Introduction to Multi-Armed Bandits (2019)

arxiv.org

142 points by Anon84 a day ago


rented_mule - a day ago

We employed bandits in a product I worked on. It was selecting which piece of content to show in a certain context, optimizing for clicks. It did a great job, but there were implications that I wish we understood from the start.

There was a constant stream of new content (i.e., arms for the bandits) to choose from. Instead of running manual experiments (e.g., A/B tests or other designs), the bandits would sample the new set of options and arrive at a new optimal mix much more quickly.

But we did want to run experiments with other things around the content that was managed by the bandits (e.g., UI flow, overall layout, other algorithmic things, etc.). It turns out bandits complicate these experiments significantly. Any changes to the context in which the bandits operate lead them to shift things more towards exploration to find a new optimal mix, hurting performance for some period of time.

We had a choice we could make here... treat all traffic, regardless of cohort, as a single universe that the bandits are managing (so they would optimize for the mix of cohorts as a whole). Or we could setup bandit stats for each cohort. If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence. But the optimal mix will likely look different for one cohort vs. another vs. all of them combined. So it's better for experiment validity to isolate the bandits for each cohort. Now small cohorts can take quite a while to converge before we can measure how well things work. All of this puts a real limit on iteration speed.

Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.

Much (all?) of this comes from bandits being feedback loops - these same problems are present in other approaches where feedback loops are used (e.g., control theory based approaches). Feedback mechanisms are incredibly powerful, but they couple things together in ways that can be difficult to tease apart.

liampulles - 16 hours ago

There is a great video with Jim Manzi on the subject of applying experiments in a business/government context - I think people interested in the subject might enjoy it:

https://www.youtube.com/watch?v=sf0vb4yiZR4

Normal_gaussian - a day ago

Its very easy to implement toy multi-armed bandits, and even getting them into production code isn't so bad. Its the reporting, education, and adding features without breaking everything that gets very hard very quick.

jmward01 - a day ago

I used bandits to select a best vendor/model to use on a per customer basis for a task that we had a reward signal for. It worked amazingly well but the more important thing we gained was the ability to accurately estimate the improvement other features/changes had. Getting a hard number for 'feature x improved over previous baseline by y' is a game changer for a lot of product and development work. You only think feature x is a massive improvement, until you see the numbers and realize that the spacing displayed to a user was 10x move valuable than the Really Big Feature you thought was going to make a difference.

codeulike - 17 hours ago

Does the name "Multi-Armed Bandits" have anything to do with "One Armed Bandits" - old style slot machine/fruit machines/gambling machines with a big lever that people would pull?

edit: ah ok yes it does: https://en.wikipedia.org/wiki/Multi-armed_bandit

Brystephor - a day ago

Hey, an area I actually work on and use in production! MAB in our case has shown improvements in our A/B metrics. My general takeaways for my experience with MAB:

* we have a single optimization goal (e.g the metric to minimize or maximize). The hard part isnt defining a optimization goal. The hard part is identifying what stakeholders actually want the tradeoff between different metrics. If you have goal A and goal B, where B is more aggressive than A, then getting agreement on which is better is hard. This is a people problem.

* MAB seems to be a good proof of concept that something that can be optimized but isnt an "end game" optimization path.

* MAB for A/B testing is going to mess up your AB data and make everything more difficult. You should have a treatment that uses the MAB algorithm and a treatment that doesnt.

All of the above is for non contextual MAB. I am currently learning about different MAB algorithms although each of them are pretty similar. The ones ive read about are all effectively linear/logistic regression and tbe differences come from exploration mechanism and how uncertainty is represented. Epsilon greedy has no uncertainty, exploration just happens a fixed percentage of time. UCB is optimistic about uncertainty, and the amount of optimism controls exploration. Thompson sampling uses statistical (beta) distributions about uncertainty, exploration happens less as confidence about a particular set of options increases.

Overall its a fun area to work in that is quite different from your typical CRUD development which is a nice change of pace.

praptak - 18 hours ago

Multi-Armed Bandits running on move trees (aka the UCT algorithm) was the invention that led the foundation to computers cracking Go.

esafak - a day ago

One way to address the https://en.wikipedia.org/wiki/Exploration%E2%80%93exploitati...

BrenBarn - a day ago

Just from a philosophical perspective I think multi-armed bandits are a fascinating idea and a great technique for thinking about many kinds of decision processes.

cdelsolar - 13 hours ago

We use multi armed bandits in scrabble Montecarlo sims to smartly pick the best performing move.