Introduction - leondutoit/data-centric-programming GitHub Wiki
Who is this for? To a large extent this wiki is written for someone like me a while ago: someone interested in doing empirical work with quantitative data in a programmatic way; someone who knows what they would like to do, but who does not have years of experience as a programmer or computer scientist; someone who has used the high-level packages to do statistical modelling and data analysis but who is looking for more control over data at a basic level; someone who wants to explore new data sources on the web and publish their insights online. To use a more fashionable term: this wiki will help you become a practical data scientist.
Why make this at all? Firstly, the traditional data analysis course or tutorial always starts with conveniently formatted data and ends with pretty graphs and tables displayed on the analyst's screen. The reality is that this is likely to be less than 50% of the work needed to have an automated data analysis produce insights that are available for many people to see. The difference between creating and running a data-centric program that does something sophisticated and gives you insight about something, on your own machine and deploying that same program to a production environment can be huge. Secondly, traditional courses do not spend enough time helping people develop good software engineering practices around their data analysis workflows.
Many small things have to be done in different ways to make sustainable contributions to a production code base. There is also a huge amount of practical work to do, and tools to master and understand, around the actual data analysis - it is hard to find resources on what these things are, how they work and why they exist.
What king of technology focus does this wiki have? The code snippets, explanations and links in this wiki are designed to help people get their data insights 'out there' with less friction. Technology is discussed to help people master all the tools that are needed to do that in a sustainable way. The primary focus is on open source tools for data management, analysis and visualisation in Python, R and d3.