Data formats - Peder2911/Diverse_Folio_Isle GitHub Wiki

Sources

Diverse folio isle supports multiple (Diverse) data sources; currently, these formats are supported:

.csv

The program will read and manipulate .csv files of data. This is particularly convenient when manipulating the same data several times, since the program outputs .csv files.

SQLite database

Database support allows you to use large amounts of cached data in analyses.

Montanus query

The main data gathering facility is the Montanus scraper, which allows you to gather data from one of Montanus' supported sources.

Runtime formats

During runtime, data is formatted multiple times, to support different operations, and to support both Python and R. The data always conforms to the standard format, which these columns:

( headline | body | source | date | id )

  • Headline is the article headline
  • Body is the document body
  • Source is a link pointing to the document source
  • Date is the given publication date of the article
  • ID is an identifier used to connect data to a specific Montanus query

List-of-dictionaries python data frame

This is the format produced by used by several python scripts to manipulate the data. This format supports direct JSON serialization. The format apes a conventional table format (ex. tab[1,'hello']), with list-entries being rows, and dictionary-keys being columns. (eg. data[0]['col'] ≈ data[0,'col']) Scripts that use this format include:

  • Montanus : Writes data in this format as a JSON string
  • dsFunctions.patternSearch : Uses this format to perform pattern searching on the data
  • dsFunctions.validateCsv : Checks .csv input and makes sure all required columns are present

The function dsFunctions.stringToStdFormat converts .csv-strings to this data format. Data in this format is passed through pipes as a .json - string.

The R script modules/Pattern/jsonToCsv.r converts a .json string into a .csv string.