Home - lvphj/epydemiology GitHub Wiki

Welcome to the ePydemiology wiki!

The ePydemiology package is a library of Python functions that was originally written to simplify some commonly-encountered data wrangling and analysis requirements of SAVSNET (the Small Animal Veterinary Surveillance Network) or to accompany teaching of basic epidemiology for veterinary undergraduate research projects, ensuring that the output of the functions matched the teaching materials. The functions make extensive use of Pandas dataframes as the option of choice for handling data. If the functions are useful, please feel free to use them.

The following functions are available:

Loading and retrieving data

1. Load data from a named cell range in an Excel workbook

myDF = epy.phjReadDataFromExcelNamedCellRange()

2. Connect to databases

myConn = epy.phjConnectToDatabase()

3. Load data from a MySQL or SQL SERVER database

myDF = epy.phjGetDataFromDatabase()

Miscellaneous functions

4. Load text from a text file or argument into a Python string variable

myString = epy.phjGetStrFromArgOrFile()

5. Load text from a text file (e.g. a SQL query or regular expression) into a Python string variable

myString = epy.phjReadTextFromFile()

6. Create a named group regex from individual regexes

myRegexStr = epy.phjCreateNamedGroupRegex()

or, if phjRegexPreCompile is set to True:

myCompiledRegexObj = epy.phjCreateNamedGroupRegex()

7. Find regular expression named-group matches in a dataframe column

myDF = epy.phjFindRegexNamedGroups()

8. Identify the maximum level of taxonomic detail in a classification

myDF = epy.phjMaxLevelOfTaxonomicDetail()

9. Reverse map a categorical variable based on dictionary values

myDF = epy.phjReverseMap()

10. Retrieve unique values from dataframes

myDF = epy.phjRetrieveUniqueFromMultiDataFrames()

11. Update dataframe with new values

myDF = epy.phjUpdateLUT()

12. Update LUT to latest values

myDF = epy.phjUpdateLUTToLatestValues()

Matrix functions

13. Convert columns of binary data to a square matrix containing co-occurrences

myArr = epy.phjBinaryVarsToSquareMatrix()

14. Convert a long dataframe to wide format containing binary variables

myDF = epy.phjLongToWideBinary()

Plotting proportions

15. Calculate and plot a series of binomial proportions

myDF = epy.phjCalculateBinomialProportions()

16. Calculate and plot multinomial proportions

myDF = epy.phjCalculateMultinomialProportions()

17. Calculate binomial confidence intervals in summary table

myDF = epy.phjCalculateBinomialConfInts()

18. Convert a disease summary table to a dataframe of binary outcomes

myDF = epy.phjSummaryTableToBinaryOutcomes()

19. Plot annual disease trend

myDF = epy.phjAnnualDiseaseTrend()

Postcode-related functions

20. Clean UK postcodes

myDF = epy.phjCleanUKPostcodeVariable()

21. Add a postcode variable formatted to 7 characters

myDF = epy.phjPostcodeFormat7()

22. Extract Ordnance Survey grid reference from column of free-text and convert to latitude and longitude

myDF = epy.phjConvertOSGridRefToLatLong)

23. Convert placename conjunctions to lowercase

myDF = epy.phjSetPlacenameConjunctionsToLower)

Select data

24. Generate a matched or unmatched case-control dataset

myDF = epy.phjGenerateCaseControlDataset()

25. Select matched or unmatched case-control data

myDF = epy.phjSelectCaseControlDataset()

26. Collapse a dataframe based on patient ID variable

myDF = epy.phjCollapseOnPatientID()

Clean data

27. Parse date variables

myDF = epy.phjParseDateVar()

28. Convert a UK (day first) date string to consistent format

myDF = epy.phjUKDateStrToDatetime()

29. Strip white space from strings in object columns of Pandas dataframe

myDF = epy.phjStripWhiteSpc()

30. Extracts minimum repeating string from string variable

myDF = epy.phjAddColumnOfMinRepeatingString()

31. Aggregate duplicate columns and rows in Pandas dataframe

myDF = epy.phjAggDupColsAndRows()

32. Convert Pandas dataframe from wide to long format

myDF = epy.phjWide2Long()

Explore data

33. View a plot of log odds against mid-points of categories of a continuous variable

myOddsRatioTable = epy.phjViewLogOdds()

34. Categorise a continuous variable using predefined breaks, quantiles or optimised break positions

myDF = epy.phjCategoriseContinuousVariable()

or, if phjReturnBreaks is set to True:

myDF,myBreaks = epy.phjCategoriseContinuousVariable()

Epidemiology-related functions

35. Calculate odds and odds ratios for case-control studies

myDF = epy.phjOddsRatio()

36. Calculate relative risks for cross-sectional or longitudinal studies

myDF = epy.phjRelativeRisk()