Shard codec implementation planning - saalfeldlab/n5 GitHub Wiki

PRs

relevant branches

codecsReadData

TODOs

short term

refactor methods in ShardParameters
low-level API details (e.g. naming. DataBlockCodec vs ArrayCodec, Compressor to ByteCodec[]s)
migrate things accepting just a Compressor to things accepting a Codec[]
- except for N5 specific stuff, if N5 does not support this feature
a shard is a block that returns blocks. this could be nice, but we don't have it now. do we go down this route?
fix locked file channel (see #141)
update n5-zarr
- in progress in wip/codecsShards and wip/codecsReadData-refactork

medium term

ensure we like the n5 architecture
- how it relates to n5-imglib2
- how it relates to writing scalable parallel code in general
- how it relates to to n5-spark in particular
what is a good way to handle the multi-codec implementation
decide whether we're committing to sharding-as-a-codec
- if yes, we can handle things generically (e.g. nested shards), but its ugly
- if not, its nicer (e.g. shard info at "top level") but nested shards will be hard, or we just don't do it

Decisions to make

Does N5 the file format support sharding + multiple codecs?
- No
- but then current architecture that ties API to the n5 format is kinda weird. *- do we change this now?

What are default behaviors / "scheduling policies" with respect to writing shards

Do we try to support nested sharding codecs, which is valid
- pro: would be nice to support anything that is valid for zarr3
- con: certainly more work
- could consider if we like the re-working of the API that would enable this more or less than the current state
How do we support the TransposeCodec
- at the lowest level (n5-core)?
- virtually using imglib2?
What is the behavior of existing methods readBlock, writeBlock for sharded datasets?
What is the behavior of new methods readShard, writeShard for un-sharded datasets?
Do we change the hierarchy
- Lots of things (too many?) are in a single package org.janelia.saalfeldlab.n5
- making subpackages could help when keeping things package protected
- e.g. "blocks", "readdata", "kva", etc
- John will make an issue
do we remove lockForReading and lockForWriting from KVAs?
- this should work in principle, because not every backend can lock
- but Preibisch uses this api, so there will be fallout, and will
- so maybe deprecate them?
- would need a new write API, what does it look like?
  - e.g. KeyValueAccess.write(String key ReadData data)

ReadData ideas

lazy splittable ReadData
- splittable does not immediately call materialize, but only when a read operation occurs
operations on a ReadData can sometimes change its behavior (invalidate data)
what happens if we get a ReadData from a kva for a key that does not exist

example

InputStream inputStream = ...;
ReadData a = ReadData.from(inputStream);
byte[] data = a.allBytes();
a.inputStream();here

notes

2025 May 22

When adding blocks to an existing shard, it would would be interesting to be able to avoid decompressing blocks when reading, just to re-compress them again when writing. This would involve reading them into "raw" ReadData structures, then directly writing the bytes out again (just not through compressors).

Consider making every ReadData splittable, with a default implementation to materialize. Infinite streams may be tricky, but would have to keep partially materialized ReadData. Tobi agrees that we should avoid materializing unless necessary. We do need to be careful for ReadData over InputStreams.

DataBlockCodec and ArrayCodec do the same thing. Right now, we think we'll keep ArrayCodec. Or use ArrayCodec to refer to the factory, and the DataBlockCodecs for the specific types(?). (John is not totally sure what this refers to, but Caleb and Tobi know).

What should ReadData do when trying to read with a KVA for a key that does not exist?
Caleb: Would be nice for them to throw the runtime N5NoSuchKeyException.
Tobi: add partial read methods to the KVA. and FileSplittableReadData wraps a KVA, and the KVA throws if no such key is thrown. Tobi is happy with ReadData methods throwing N5IOException.

2025 May 23

with Stephan.

"unsharded" datasets will report a shard size of 1 block.

n5-imglib2 and n5-"core" should share whatever utility method determines the shard to which a set of blocks belong. (Note: these methods are currently in ShardParameters)

2025 May 27

consider renaming "BytesCodec"
should one be allowed to get a Shard for a non-sharded dataset
- if we do, we should have a Shard type that holds one Block and nothing else

2025 June 02

[Attribute path normalization refactor](https- [x] Finalize ReadData PR https://github.com/saalfeldlab/n5/pull/137
- (outdated) ~~and https://github.com/saalfeldlab/n5/pull/139~~ ://github.com/saalfeldlab/n5/commit/5d13af6d8b9f757da34dc3c1fe04cedac38533b2)
Exception handling changes
Invalid block behavior test and issue
DataBlock decoding problem
significant DatasetAttributes constructor changes
- related, [make ConcatenatedByteCodec public- [x] Finalize ReadData PR https://github.com/saalfeldlab/n5/pull/137
- (outdated) ~~and https://github.com/saalfeldlab/n5/pull/139~~ ,](https://github.com/saalfeldlab/n5/commit/a69f025387b46cc889df967e7f17faaf5eabacee) but this would be good to change

Many of the above were motivated when updating n5-zarr here

Aside: N5DataBlockCodec is technically not a DeterministicSizeCodec (one also has to know the data type)

For future:

document and go over serialization annotations.

2025 June 12

Decisions we've made
- We will remove split
Deciding on the boundary of IOException vs N5Exception
- allow lambdas to throw N5Exception
- KVA methods should throw N5Exception: (lockForReading, lockForWriting, list, etc)
which of the current open PRs can we merge now?
- given that changes are always possible (even big ones)
- Do the KVA exception stuff in the exception PR tomorrow, then merge it.
  - and any necessary downstream changes (in compressions and backends)
ReadData behavior (currently outlined in tests)
- when are out of bounds exceptions thrown
- -1 as a length argument means "to the end"
- zero length?
sharding will be the next PR, will include:
- ShardingCodec
- DatasetAttribute additions
- Shard interface and implementations (VirtualShard, InMemoryShard)
- new API methods N5Reader#readShard, N5Writer#writeShard
Later, consider something in-between materialize().size() and size(), which does not read data, but queries the backend to find the size (if possible)

Ideas looking ahead to convenience for parallel writing:

something called LazyDataBlock or similar that takes a Supplier<(datatype)> or something that generates the block data on the fly when needed
- also potentially useful if we read a whole shard but don't want to decode every block, the supplier in this case could be the decoding operation over the correct part of the whole shard's ReadData
Methods that iterate over shards / blocks in a sensible way
Nethods that take the function above and write in parallel in a sensible way

Doc:

writeShard overwrites
writeBlock overwrites
writeBlocks merges shard (if sharded)

A Shard perhaps should not contain DataBlocks, but "StagedDataBlocks" where staged means "read" but not decoded. Different "flavors" of shards would be useful in general.

Done

Finalize ReadData PR https://github.com/saalfeldlab/n5/pull/137
- (outdated) ~~and https://github.com/saalfeldlab/n5/pull/139~~
Add in SplittabbleReadData https://github.com/saalfeldlab/n5/pull/140,
- in progress on branch wip/codecsReadData
- compare to https://github.com/tpietzsch/n5/pull/1
- consider slice, limit, split, discussion here
- consider head, tail methods.
TODO for John. Do we have tests for ConcatenatedBytesCodecs? Write some if not. done
release current master as of (2025-June-26)
Add and run benchmarks (see Tobi's comment)