Performance - uhop/stream-json GitHub Wiki
This toolkit processes huge files. Even a microsecond per operation adds up: over 1 billion operations that's ~16.5 minutes. A millisecond per operation adds ~11.5 days.
For quantitative comparisons between components, see Benchmarks.
Performance played a major role in the design of stream-json. Below are best practices for getting the most out of it.
Every link in a pipeline introduces latency. Combine small filters and transforms into one component where possible (examples use stream-chain):
const {chain} = require('stream-chain');
// fine-grained, but less efficient
chain([
sourceStream,
// filters
data => (data.key % 2 !== 0 ? data : null),
data => (data.value.important ? data : null),
// transforms
data => data.value.price,
price => price * taxRate
]);
// more efficient
chain([
sourceStream,
data => {
if (data.key % 2 !== 0 && data.value.important) {
return data.value.price * taxRate;
}
return null; // ignore
}
]);Stream boundaries are relatively expensive. Use them when components produce a varying number of items — this takes advantage of built-in backpressure handling. Otherwise, plain function calls are more efficient.
Less traffic means faster pipelines. Arrange filters to remove items as early as possible:
// let's assume that we have a small number of important objects,
// and valid() is an expensive function to calculate
// fine-grained, but less efficient
chain([
sourceStream,
// filters
data => (valid(data) ? data : null),
data => (data.value.important ? data : null)
]);
// better
chain([
sourceStream,
// filters
data => (data.value.important ? data : null),
data => (valid(data) ? data : null)
]);
// best
chain([
sourceStream,
// filters
data => (data.value.important && valid(data) ? data : null)
]);Put cheap, high-rejection filters first. The same goes for transforms.
stream-json 2.x uses stream-chain under the hood.
It provides a way to create pipelines of streams. While chain() can consume different types of
components, in order of efficiency they are:
- Functions (the most efficient)
- Asynchronous functions
- Generator functions
- Async generator functions
- Node streams
- Web streams (least efficient)
The list is current as of Node 25. For other runtimes or versions >25, the efficiency may vary. If the speed is very critical, consider testing different approaches.
Parser streams values by default, but sometimes only the final value is needed. Parser can pack values and emit xxxValue tokens directly. No need to replicate this in custom code unless it offers measurable benefits.
When packed values are sufficient, suppress streaming chunks to reduce traffic. Use streamXXX options (effective only when packing the corresponding values):
const p1 = parser();
// streams values like that:
// {name: 'startString'}
// {name: 'stringChunk', value: 'a'} // zero or more chunks
// {name: 'stringChunk', value: 'b'}
// {name: 'endString'}
// {name: 'stringValue', value: 'ab'}
// In reality, it is unlikely to have chunks one character worth.
const p2 = parser({packValues: false});
// streams values like that:
// {name: 'startString'}
// {name: 'stringChunk', value: 'a'} // zero or more chunks
// {name: 'stringChunk', value: 'b'}
// {name: 'endString'}
const p3 = parser({streamValues: false});
// streams values like that:
// {name: 'stringValue', value: 'ab'}Downstream components may require specific token types. For example, filters require packed key values. Replace has additional requirements described below. Stringer defaults to value chunks but can use packed values instead (see below).
The main module creates a parser decorated with emit(). If token events are not needed, use parser() directly:
const makeParser = require('stream-json');
makeParser().pipe(someFilter); // token events are not used
// better: use parser() directly without the emit() decoration
const {parser} = require('stream-json');
parser.asStream().pipe(someFilter);(Since 1.6.0) If you deal with a strict JSONL (or NDJSON) format, and convert token streams to JavaScript objects using streamers, use a dedicated JSONL parser to improve performance.
For simple cases, use stream-chain/jsonl/parser directly. Use stream-json/jsonl/parser only when you need errorIndicator or checkErrors.
const makeParser = require('stream-json');
const {streamValues} = require('stream-json/streamers/stream-values.js');
chain([makeParser({jsonStreaming: true}), streamValues(), someConsumer]);
// more efficient — stream-chain (recommended)
const jsonlParser = require('stream-chain/jsonl/parser.js');
chain([jsonlParser(), someConsumer]);
// stream-json — when you need errorIndicator/checkErrors
const sjJsonlParser = require('stream-json/jsonl/parser.js');
chain([sjJsonlParser({errorIndicator: null}), someConsumer]);A common case is selecting a single item from a stream. After the match, no further items can match, yet the filter continues to process them. This is especially common with string filters doing a direct match. Set {once: true} to stop filtering after the first match.
Replace can generate substreams by itself:
- A replacement substream provided by a user.
- Property keys can be generated on a replacement.
Ignore is based on Replace.
The replacement itself is fully controlled by the user, while generated keys can include a streaming part:
const r1 = replace();
// can generate keys like that:
// {name: 'startKey'}
// {name: 'stringChunk', value: 'a'}
// {name: 'endKey'}
// {name: 'keyValue', value: 'a'}
const r2 = replace({streamKeys: false});
// can generate keys like that:
// {name: 'keyValue', value: 'a'}Usually the same value style should be used across the entire pipeline.
All streamers support objectFilter to discard objects during assembly rather than after. This can be more efficient, but consider:
- If the property needed for the decision usually comes last, the whole object is assembled before the decision can be made — no benefit.
-
objectFilteris called on every update during assembly. If the function is expensive, it may be cheaper to filter after assembly.
// variant #1
chain([
sourceStream,
streamArray({
objectFilter: asm => {
const value = asm.current;
if (value && value.hasOwnProperty('important')) {
return value.important;
}
// return undefined; // we are undecided yet
}
})
]);
// variant #2
chain([
sourceStream,
streamArray(),
data => {
const value = data.value;
return value && value.important;
}
]);Analyze or benchmark your data stream to decide the most efficient filtering strategy.
Utf8Stream (deprecated) is only useful when reading binary buffers. If the input is already string data, it is a no-op and can be removed from the pipeline.
withParser() returns a functional pipeline for use inside chain(). The withParserAsStream() variant wraps it in a Duplex stream for .pipe(). Using the functional form inside chain() avoids the extra stream boundary:
// wraps in a Duplex — adds a stream boundary
const pipeline = streamArray.withParserAsStream();
fs.createReadStream('sample.json').pipe(pipeline);
pipeline.on('end', () => console.log('done!'));
// more efficient: functional form inside chain()
const pipeline = chain([fs.createReadStream('sample.json'), streamArray.withParser()]);
pipeline.on('end', () => console.log('done!'));Assembler provides consume(data), but you can dispatch directly without the function-call overhead:
data => asm[data.name] && asm[data.name](data.value);Consider this when building custom components around Assembler.
Stringer uses value chunks (stringChunk and numberChunk) to produce its output. If streaming was disabled upstream with {streamValues: false}, Stringer will break. Switch it to packed values with: useValues, useKeyValues, useStringValues, or useNumberValues. Always ensure the pipeline is consistent.
Both Emitter and emit() are convenience helpers. If they prove to be a bottleneck, they are easy to bypass:
// variant #1
const emitter = require('stream-json/emitter.js');
const e = emitter();
sourceStream.pipe(e);
e.on('startObject', () => console.log('object!'));
// variant #2
emit(sourceStream);
sourceStream.on('startObject', () => console.log('object!'));
// more efficient variant #3
sourceStream.on('data', data => data.name === 'startObject' && console.log('object!'));