What data scientists wish data engineers knew - KeynesYouDigIt/Knowledge GitHub Wiki

What-data-scientists-wish-data-engineers-knew

As a Data Scientist, if you could pick five skills of yours to teach every programmer, what five would you pick?

Question asked by Vince Buscarello

Five is a pretty arbitrary number, but here are five things every engineer should understand about experiments. The question is more general than that, but the majority of what I teach engineers about data science is actually about running and interpreting experiments.

How to size an experiment.

This is actually pretty easy to do, but data scientists always get asked to do it.

Decide what error bars you want for your metrics.
Compute the error bars on an A-A test of arbitrary size.
Scale up or down using the fact that error bars decrease by n when you increase your sample by n^2. Reduce communication complexity and do it yourself.

What confidence intervals mean.

If you ask even a data scientist for a confidence interval definition, there’s a decent chance they’ll get it wrong. Thus, I understand it’s a hard idea because they don’t mean what you’d want them too. Lots of engineers (and other people) want to treat statistical significance as a line between “there’s definitely something there” and “there’s definitely nothing there,” but that’s not how things work especially when you look at multiple metrics, multiple experiments, or multiple points in time. Confidence intervals are designed for looking at one metric you strongly believe could move and looking at that metric exactly once. Anything else gets fuzzy, and it becomes too easy to make unsupported claims. Realistically, I don’t expect engineers (or anyone who isn’t specialized in analysis) to get all of this right, but more skepticism about error bars would be an improvement.

A single point of data is never “trending.”

I have no idea where this comes from, but I’ve heard it from multiple people at every place I’ve been a data scientist. It must be from somewhere. As far as I can tell, non-data scientists use the word trending to mean “this move isn’t statistically significant, but it’s in the direction we wanted so we’re going to report it anyway.” You have no hope of getting repeatable results if that’s how you’re going to treat experiments. I know trending can have a legitimate meaning in experimental contexts, but at this point it’s just safer to just forbid the word’s use altogether.

How to set up exposure logging.

Every company I’ve worked at has had some concept of “the set of users actually impacted by an experiment or would have been if they weren’t in the control group.” There’s good reason to want this. If a user never gets to the feature you changed, their behavior can’t be any different. Adding them to your computations only introduces unnecessary noise. The problem is that it seems to be very hard for engineers to correctly mark users as “exposed.” Almost every engineer who sets this up gets it wrong the first time, and so their experiment is busted.

A user must be marked exposed before any code is run differently between treatment and control.

Yes, that includes if you pre-compute something but don’t show it immediately.
Yes, that includes if the user is just hitting a different server.
Yes, if they were exposed once they’re exposed forever.

Tiny imbalances between treatment and control sizes can often look like a plausible metric move, so an eager engineer might miss that they’re shipping on bogus data. At a minimum, make sure treatment and control have the same number of samples. If not, you’re likely in trouble.

Write a complete experimental plan before you start running your experiment.

This plan should include what you expect to happen and why, which metrics you will measure, and when you will measure them. All of this will put you on the right track to good analysis. You can decide to deviate from it for good reason later. For example, you might see something unexpected you want to explore, or you might want to abort something terrible early. Having the plan in place though, will make it clear that everything outside the plan has a higher burden of proof. If you get mushy findings by running longer or looking at undeclared metrics, maybe you should run a new version to see if your results are repeatable.

Answer by Chris Luhrs https://www.quora.com/profile/Chris-Luhrs https://www.linkedin.com/in/chris-luhrs-6b378219/

956 Views · View Upvoters · Answer requested by Vince Buscarello OP - https://www.quora.com/As-a-Data-Scientist-if-you-could-pick-five-skills-of-yours-to-teach-every-programmer-what-five-would-you-pick