Shard codec implementation planning - saalfeldlab/n5 GitHub Wiki
- Finalize ReadData PR https://github.com/saalfeldlab/n5/pull/137
- (outdated)
and https://github.com/saalfeldlab/n5/pull/139
- (outdated)
- refactor methods in
ShardParameters
- Add in SplittabbleReadData https://github.com/saalfeldlab/n5/pull/140,
- in progress on branch
wip/codecsReadData
- compare to https://github.com/tpietzsch/n5/pull/1
- consider
slice
,limit
,split
, discussion here - consider
head
,tail
methods.
- in progress on branch
- low-level API details (e.g. naming.
DataBlockCodec
vsArrayCodec
,Compressor
toByteCodec[]
s) - migrate things accepting just a
Compressor
to things accepting aCodec[]
- except for N5 specific stuff, if N5 does not support this feature
- a shard is a block that returns blocks. this could be nice, but we don't have it now. do we go down this route?
- fix locked file channel (see #141)
- TODO for John. Do we have tests for
ConcatenatedBytesCodecs
? Write some if not. done - update n5-zarr
- in progress in
wip/codecsShards
andwip/codecsReadData-refactor
k
- in progress in
- ensure we like the n5 architecture
- how it relates to n5-imglib2
- how it relates to writing scalable parallel code in general
- how it relates to to n5-spark in particular
- what is a good way to handle the multi-codec implementation
- decide whether we're committing to sharding-as-a-codec
- if yes, we can handle things generically (e.g. nested shards), but its ugly
- if not, its nicer (e.g. shard info at "top level") but nested shards will be hard, or we just don't do it
- remove lockForReading and lockForWriting from KVAs
- Does N5 the file format support sharding + multiple codecs?
- No
- but then current architecture that ties API to the n5 format is kinda weird.
- do we change this now?
- What are default behaviors / "scheduling policies" with respect to writing shards
- Do we try to support nested sharding codecs, which is valid
- pro: would be nice to support anything that is valid for zarr3
- con: certainly more work
- could consider if we like the re-working of the API that would enable this more or less than the current state
- How do we support the
TransposeCodec
- at the lowest level (n5-core)?
- virtually using imglib2?
- What is the behavior of existing methods
readBlock
,writeBlock
for sharded datasets? - What is the behavior of new methods
readShard
,writeShard
for un-sharded datasets?
- lazy splittable ReadData
-
splittable
does not immediately callmaterialize
, but only when a read operation occurs
-
- operations on a ReadData can sometimes change its behavior (invalidate data)
- what happens if we get a ReadData from a kva for a key that does not exist
InputStream inputStream = ...;
ReadData a = ReadData.from(inputStream);
byte[] data = a.allBytes();
a.inputStream();here
When adding blocks to an existing shard, it would would be interesting to be able to avoid decompressing blocks when reading, just to re-compress them again when writing. This would involve reading them into "raw" ReadData
structures, then directly writing the bytes out again (just not through compressors).
Consider making every ReadData splittable, with a default implementation to materialize. Infinite streams may be tricky, but would have to keep partially materialized ReadData. Tobi agrees that we should avoid materializing unless necessary. We do need to be careful for ReadData over InputStreams.
DataBlockCodec and ArrayCodec do the same thing. Right now, we think we'll keep ArrayCodec. Or use ArrayCodec to refer to the factory, and the DataBlockCodecs for the specific types(?). (John is not totally sure what this refers to, but Caleb and Tobi know).
What should ReadData do when trying to read with a KVA for a key that does not exist?
Caleb: Would be nice for them to throw the runtime N5NoSuchKeyException.
Tobi: add partial read methods to the KVA. and FileSplittableReadData wraps a KVA, and the KVA throws if no such key is thrown. Tobi is happy with ReadData methods throwing N5IOException
.
with Stephan.
"unsharded" datasets will report a shard size of 1 block.
n5-imglib2 and n5-"core" should share whatever utility method determines the shard to which a set of blocks belong.
(Note: these methods are currently in ShardParameters
)
- consider renaming "BytesCodec"
- should one be allowed to get a Shard for a non-sharded dataset
- if we do, we should have a Shard type that holds one Block and nothing else
- Attribute path normalization refactor
- Exception handling changes
- Invalid block behavior test and issue
- DataBlock decoding problem
- significant
DatasetAttributes
constructor changes- related, make ConcatenatedByteCodec public, but this would be good to change
Many of the above were motivated when updating n5-zarr here
Aside: N5DataBlockCodec is technically not a DeterministicSizeCodec (one also has to know the data type)
For future:
- document and go over serialization annotations.
-
Decisions we've made
- We will remove
split
- We will remove
-
Deciding on the boundary of
IOException
vsN5Exception
- allow lambdas to throw
N5Exception
- KVA methods should throw N5Exception: (lockForReading, lockForWriting, list, etc)
- allow lambdas to throw
-
which of the current open PRs can we merge now?
- given that changes are always possible (even big ones)
- Do the KVA exception stuff in the exception PR tomorrow, then merge it.
- and any necessary downstream changes (in compressions and backends)
-
ReadData behavior (currently outlined in tests)
- when are out of bounds exceptions thrown
-
-1
as a length argument means "to the end" - zero length?
-
sharding will be the next PR, will include:
- ShardingCodec
- DatasetAttribute additions
- Shard interface and implementations (VirtualShard, InMemoryShard)
- new API methods N5Reader#readShard, N5Writer#writeShard
-
Later, consider something in-between
materialize().size()
andsize()
, which does not read data, but queries the backend to find the size (if possible)
Ideas looking ahead to convenience for parallel writing:
- something called
LazyDataBlock
or similar that takes aSupplier<(datatype)>
or something that generates the block data on the fly when needed- also potentially useful if we read a whole shard but don't want to decode every block, the supplier in this case could be the decoding operation over the correct part of the whole shard's ReadData
- Methods that iterate over shards / blocks in a sensible way
- Nethods that take the function above and write in parallel in a sensible way
Doc:
- writeShard overwrites
- writeBlock overwrites
- writeBlocks merges shard (if sharded)
A Shard perhaps should not contain DataBlocks, but "StagedDataBlocks" where staged means "read" but not decoded. Different "flavors" of shards would be useful in general.