Advice on explanatory work - ganong-noel/lab_manual GitHub Wiki
Advice on “what to do within a structured task exploring a dataset”
Imagine that you are writing a guide on how to do exploratory data analysis in gnlab. What are the three top pieces of advice would you give to your prior self?
- Identify what it is that the PI wants to know in the ticket (and ideally why). How does the output that you are producing answer that question and is it the best way to answer the question?
- The first comment may well contain unfamiliar jargon. Be sure you understand all terms used in the first task related to data exploration/benchmarking for a dataset, especially if the dataset is new to gnlab (and even is not, good for checking your understanding so someone else can pick up on it). For example, initial claims in the UI system can be complicated. It can be useful to have a list of definitions written up in your own words stowed away somewhere. This is also useful for coming back to a task / other RAs picking up the task / code review.
- Read the documentation.
- If it isn’t helpful, read papers that used the data before
- Don’t assume that a variable is defined in the way that seems intuitive to you - look through the documentation and at a few values of the actual data to make sure that your understanding of the data is consistent with what’s actually there.
- Are broad patterns in the data consistent with your intuition?
- Try to benchmark the data with other sources, even if it's only an approximation. For example, UI claims with unemployment rate.
- Also benchmark against your own work that you've previously done (ex. if you previously found a default rate of 5% but your current analysis implies a default rate of 10% in the same sample, something is inconsistent)
- Set unit tests ahead of time for things you know must be true. This way you don’t spend time “exploring” data that is wrong in the first place.
- An example: say you want to plot checking account inflows over time. You balance the sample so that the composition of accounts isn't changing across time (or maybe you think that it's already balanced). Adding a testthat to confirm that there are the same number of observations in each time period checks that you did this correctly. Say you didn't check this and the sample was actually unbalanced, maybe leading to rising inflows over time. Then you might waste time trying to interpret this "pattern in the data" when really it was a coding bug from your data cleaning.
- Another example: we've (I've) had issues with implicitly dropping zeros when aggregating the Chase transaction-level data. Adding a testthat to check that the minimum value of inflows or whatever variable we're interested in is 0 is an easy way to make sure that we catch mistakes like this before we start exploring and interpreting flawed data
- Visualize intermediate steps. How does a series change after dropping observations? What changes do you see if you limit to certain regions/groups/income percentiles? Visualize missing data (visdat package is here very helpful).
- Does the output accord with your own intuition and sanity checks? If there is a result that is unexpected or doesn’t make sense, where do you think this result has come from? Have you been sure to check that intermediate results also make sense? Depending on how long it takes to run down the confusion, it might make sense to do it straight away or to post a path forward to have a PI approve before implementing.
- You are closer to the data than the PI. If you don’t share the bugs/problems in the data, then other PIs and RAs will never know about it. In the process of producing the output, did you find anything or are you suspicious of anything that the PI didn't ask for but might want to know now? Examples are things like unexpected outliers, unexpected missing data, distributions that don’t are unexpected in shape so on. If there is a reason to be skeptical of the results, have you communicated this?
- If you’re sharing exploratory work with someone else, be explicit about what you’re showing. This goes for plot labelling as well as descriptions in ticket comments. Be explicit about choices you made along the way (ex. “I filtered on x for y reason”) so that the people you’re sharing things with (PIs or RAs) are on the same page as you.
- Share intuition/analysis of what you’ve just found in words: What does this plot/statistic/thing tell us?
- Does the output alone provide a clear answer to the original prompt?
- For example, if you can’t see the variation on a bar plot because of an outlier, and the PI is interested in the variation then it makes sense to produce a second version without the outlier. (If the PI is interested in how much of an outlier a particular observation is, then maybe it doesn’t make so much sense).
- If applicable, consider and propose next steps that you think would help to address the original prompt of the ticket.
- Example message: “If let to my own devices, this is what I would do next”
- Before you post output, spend up to minutes implementing these next steps and incorporating what you’ve learned into the write up.
I think what is most difficult about these types of structured tasks (still, but particularly as a new RA) is that it’s frequently hard to tell where PIs sit on what should be done that’s in addition to the literal word of the ticket. I think for example it’s hard to know whether it is worth it to spend say 30 minutes on a graph/table that might be the next line of questioning to save time, or then what about an hour or 2 hours? Is it worth it to spend 5 minutes to produce a graph that makes a point you want to make clearer, how about an hour or a day? Is it worth it to spend an hour thinking and writing about the results when it will likely be obvious what is going on to the PI, how about 3? If a task is turning into a rabbit hole, when should you stop? I think if the goal of codifying this is to improve the quality of output you are receiving from new RAs, it would also be helpful (or potentially more helpful) to give guidance on these types of principles.
I totally agree with Pete's last point on determining how much "extra effort" is appropriate on tickets. This is something I still struggle with and was especially hard as a new RA. I'm not sure what the right way of giving guidance on this is because it probably tends to be ticket-specific, but any sort of rules of thumb for managing the trade offs between posting output in a timely manner and doing work that you think is relevant but not explicitly asked for would be helpful for new RAs.
I think Pete's last paragraph (about judging what should be done in addition and how long should be spent on outputs/writing up an explanation that might be extraneous) is something still relevant to me now. I think guidance on this would be helpful. For example, if someone is unsure if they should spend more time on it, a compromise might be to detail in a github comment what the next steps are and how long they think this might take so that a PI can review this.
- if left to my own devices, this is what I would do next
- Especially hard when first working in lab to know how much to do
- Work that (a) is not assigned, and (b) seems quite useful to address the objective. Even if you knew that it is worth it in expectation, but you are not comfortable justifying the work you are doing then you might not do it. (“confidence issue”)
“Before you post output, spend to minutes producing an additional piece of output that you think helps answer the question in the ticket” as a checkbox. I think this resolves both the time issue (since you’ve budgeted the time for them) and the confidence issue (since you’ve asked for it) which might be helpful in trying to encourage people to go beyond the literal word of the ticket. I think PIs response to these types of additional output can help gauge then in the future what kinds of additional output you find helpful or not helpful and the amount of time that you expect to be spent on it…