cache specification - hapi-server/data-specification GitHub Wiki
1 Purpose
2 Specification
2.1 Directory structure, file names, file time ranges
2.2 Index database
3 hapi-cache.jar
4 Meeting Notes
5 Other Notes
6 Nobes Notes
- Develop a cache db specification. Given two clients that implement the spec, their cache can be shared. Database specification details include directory structure, file names, file time ranges, index db (so software does not need to do a recursive directory listing to determine if things it needs are available).
- Develop a program (
hapi-cache.jar
) that implements the specification; we expect many HAPI clients to use it on the back-end (clients may or may not expose all of the command line arguments tohapi-cache.jar
.
HAPI_DATA
should be the environment variable indicating the HAPI cache directory. If not specified, the logic of Python's tempfile
module should be used to get the system temporary directory to which hapi_data
should be appended, e.g., /tmp/hapi_data
will be a common default.
Data directory naming: If cadence
is given
-
cadence < PT1S
- files should contain 1 hour of data and be in subdirectoryDATASET_ID/$Y/$m/$d/
. File names should be$Y$m$dT$H.VARIABLE.EXT
. -
PT1S <= cadence <= PT1H
- files should contain 1 day of data and be in subdirectoryDATASET_ID/$Y/$m/
. File names should be$Y$m$d.VARIABLE.EXT
. -
cadence > PT1H
- files should contain 1 month of data of data and be in a subdirectory ofDATASET_ID/$Y/
. File names should be$Y$m.VARIABLE.EXT
.
If cadence
is not given, the caching software should (use the rule ... always do daily (Jeremy)? Or more well defined (Nobes)?) and choose the appropriate directory structure. Likewise, software using the cache should assume that other software may have different logic and should check all resolutions.
Files should contain only data for the parameter, e.g., 19991201.Time.csv
will contain a single column with just the timestamps that are common to all parameters in the dataset. The file 19991201.Parameter1.csv
would not contain timestamps. If a user requests Parameter1
, a program reading the cache will need to read two files, the Time
file and the Parameter1
file, to return the required data for Parameter1
.
Directory structure for PT1S <= cadence <= PT1H
:
hapi_data/
# http://hapi-server.org/servers/SSCWeb/hapi
http/
hapi-server.org/
servers/
SSCWeb/
hapi/
capabilities.json
capabilities.json.httpheaders
catalog.json
catalog.json.httpheaders
data/
info/
# https://cdaweb.gsfc.nasa.gov/hapi
https/
cdaweb.gsfc.nasa.gov/
hapi/
capabilities.json
capabilities.json.httpheaders
catalog.json
catalog.json.httpheaders
data/
A1_K0_MPA/2008/01/
20080103.csv{.gz} # All parameters
20080103.csv{.gz}.httpheaders # All parameters
20080103.binary{.gz} # All parameters
20080103.Time.csv{.gz} # Single column
20080103.Time.binary{.gz}
20080103.sc_pot.csv{.gz} # Single column
20080103.sc_pot.binary{.gz}
...
AC_AT_DEF/2009/02/
...
info/
A1_K0_MPA.json
AC_AT_DEF.json
...
Usage
java -jar hapi-cache.jar --url "https://server/hapi/catalog"
java -jar hapi-cache.jar --url "https://server/hapi/info?dataset=..."
java -jar hapi-cache.jar --url "https://server/hapi/data?dataset=...parameters=...&start=...&stop=...&format={csv,bin}"
java -jar hapi-cache.jar \
--server "https://server/hapi" --dataset=... --parameters=... --start=... --stop=... --format={csv,bin}
We also need to allow https://server/hapi/info?dataset=...
Response is csv or binary according to format
. Default behavior when used as client is to use HTTP headers + existing cache to make decision as to how to return data (use cache or make new request). For server is to use file timestamps (or HTTP headers on back-end server if used in pass thru mode).
Other options:
--cache-dir DIR
--write-cache [T] or F (write cache if not there)
--use-cache [T] or F (use cache if there)
--dry-run If given, report the actions it would take
--cache-exact (only cache exact request; will lead to less cache hits, but fast cache response if exact request made again)
--use-expired-if-error (if any existing file in db is expired and attempt to update failed, use stale cache instead of returning error)
--expire-after N{y,d,h,m,s} (use this word? Don't use cache if written > N{y,d,h,m,s} ago - this is a feature of Python `requests_cache` lib;
1 Allow only --url
?
2. Allow format=json
?
Issues:
- Should metadata (http headers) be cached as well?
- Should the scientist be able to lock the cache so that updates will not occur?
- Thread safety. As we develop, continue to ask if this can be added later without complication. Possible locking mechanism: https://stackoverflow.com/questions/11787567/cache-locking-for-lots-of-processes
Options to consider:
- Options at https://requests-cache.readthedocs.io/en/stable/modules/requests_cache.session.html#requests_cache.session.CachedSession
-
java -jar hapi-cache.jar --url "..." --average P1D
Create block average and store. (Probably don't try to implement this in hapi-cache.jar; have something else).
Suggestion:
Come up with minimal spec and implementation for hapi-cache.jar
. I suggest it is whatever is needed to support /info
and /catalog
requests (--exact
is implied for such requests).
Bob:
I think that we should also consider splitting the library problem into two parts:
-
Caching of metadata - Many libraries implement the HTTP RFCs already. A client requests some metadata through a hapi-cache proxy server or library, and the RFCs are followed. So hapi-cache does not necessarily need to support the RFCs because HAPI client libraries could use existing libraries for this functionality. However, it may be convenient to support the RFCs. I think we all need to be familiar with the RFC terminology and use the same terms, as appropriate for part 2. below.
-
Data requests - When a client requests data from a hapi-cache proxy server or library, the response can be built up based on existing data in the cache. This is a non-trivial problem. I had a CS student implement something like this in the Python HAPI client. See the documentation at https://github.com/hapi-server/client-python/blob/master/hapiclient/hapi.py#L204.
We should also consider the motivation for 2. (and why I proposed the project):
a. Because some HAPI servers are slow, Eelco wants to proxy requests through a hapi-cache proxy server (could be local or remote). b. I want something like the Python client features in the MATLAB client. If I have someone implement the Python features in MATLAB, the features in two client libraries will eventually diverge. We could have the MATLAB client wrap the Python client, but it may be simpler to have the MATLAB client pull down a jar file that starts a caching proxy server, and then the client routes all requests through the proxy. c. Somebody has to maintain the Python cache code and add features. I prefer to rip it out and include the (hopefully small) jar file for the proxy server in the package and leverage new features developed by someone else. d. Jeremy was already implementing many of the features in 2. in Autoplot. This code should be available to a wider audience and independent of Autoplot.
The motivation for the database schema is cache sharing among clients. This is tricky because of locking. It is not critical to publish a schema, but I expect we won't regret it eventually.
• What is the remote source of the data in my cache
• What data (parameters) is in my cache
• What are the time specifics of the data in my cache-tools
◦ How old is the data in my cache
◦ What are the time spans (or gaps) in my cache
• What is the serialization format of my cache (bin, csv, json, …)
• How much disk space does my cache take up
◦ Arranged by parameters
◦ Arranged by time
• What is the historical log of the data (where was it extracted and when)
- Allows flushing the cache
• Flush by some time range (remove stale data, too recent data)
• Flush by parameters (possibly by some corresponding constraints?)
• Flush by remote source - Allows extraction of data from the local cache or the remote source
• Utilize the local cache for time that falls within some time range
• Utilize the remote source for time that falls within some time range
Audience (of users):
- Researchers that want to package or publish their cache (prepare for DOI?)
- Scientist that want consistent data for studies with adjusted parameters
- Scientist that want latest data for studies using consistent parameters
- Software developers that will utilize this as a component in their software package
→ Note these “users” will lead to high level “use case” options - Limited network – no pulling of data (performance)
High level “use case” options:
These high level options should be strictly composed of lower level options. These options allow for common use cases without the need to specify or be knowledgeable of the lower level options. Note that the high level use case option essentially provides defaults but those defaults can be overridden by the specification of lower level options (after the user case option). The pattern is: --use-case-
--use-case-latest:
This tells cache-tools to always attempt to get the latest data from the remote source, and
only if not available, then resort to the local cache.
This tells cache-tools to only attempt to get the latest data from the remote source
--use-case-timeout: :
This tell cache-tools to always resort to the local cache, as long as the data is newer than the
timeout value. Otherwise if older (or not available) go pull from remote.
--use-case-offline:
This tells cache-tools to always resort to the local cache~~, and if there is no data there then pull
from the remote source.~~
Options to specify the HAPI endpoint:
These options allow the user to specify the remote HAPI data end point or a HAPI server specification. There are two (mutually exclusive) mechanisms for specifying the HAPI address:
[1] You give it the fully qualified: --url argument
or
[2] You give it the --server argument along with a combination of –dataset, --format, --parameters, --start, --stop
--url :
URL to the HAPI data endpoint. This argument is not compatible with args: {--server,
--dataset, --format, --parameters, --start, --stop}
--server :
URL to the HAPI server.
--dataset :
Target dataset from the HAPI server.
--format=:
HAPI stream serialization format. Values: BIN, CSV, JSON
--parameters …:
Target parameters from the HAPI dataset.
--start :
The start time (of the HAPI data).
--stop :
The stop time (of the HAPI data).
Options to control HAPI modification mode:
When you use cache-tools, what is the default modification mode? Should it apply changes by default or should it be explicit?
Some arguments:
--dry-run vs --execute
--dry-run:
This tells cache-tools to not alter the cache, but rather does a trial run showing the actions it
would take.
--execute:
By default cache-tools will perform a “dry run” and thus will not modify the cache. This option
will cause the cache-tools to perform any (modification) rather than just preview
Options to configure HAPI logging:
These options control what is logged to stdout / stderr.
--help, -h:
Display this help message
-- verbose, -v:
Controls the verbose level. Specify multiple times to increase verbosity.
--warn-remote-fail: <all, off, error-codes, ...>
--warn-remote-unavailable:
--warn-remote-timeout:
Warns the user if there was a failure (unavailable, timeout, access denied) to access to the
remote source. By default this option is set to ‘all’.
--warn-cache-stale: <all, off, ...>
Warns the user if the local cache is stale and will be updated. By default this option is set to
‘off’.