Workflow - nsip/nias GitHub Wiki

Workflow

The NIAS workflow uses a chain of microservices, receiving content from a REST call, and passing it through different topics in Kafka. The chain of calls is summarised below.

Summary

  • HTTP POST xml to /:topic/:stream/bulk >> Kafka topic: sifxml.bulkingest, message: TOPIC topic.stream + xml-fragment1, xml-fragment2, xml-fragment3 ...

    • HTTP POST xml to /:topic/:stream >> Kafka topic: sifxml.ingest, message: TOPIC topic.stream + xml
  • Kafka topic: sifxml.bulkingest, message: TOPIC topic.stream + xml-fragment1, xml-fragment2, xml-fragment3 ... >> Kafka topic: sifxml.validated, message: TOPIC topic.stream 1:n:doc-id + xml-node1, TOPIC topic.stream 2:n:doc-id + xml-node2, TOPIC topic.stream 3:n:doc-id + xml-node3 ...

    • Kafka topic: sifxml.ingest, message: TOPIC topic.stream + xml ... >> Kafka topic: sifxml.validated, message: TOPIC topic.stream 1:n:doc-id + xml-node1, TOPIC topic.stream 2:n:doc-id + xml-node2, TOPIC topic.stream 3:n:doc-id + xml-node3 ...
  • Kafka topic: sifxml.validated, message m >> Kafka topic: sifxml.processed, message m

  • Kafka topic: sifxml.processed, message: TOPIC topic.stream m:n:doc-id + xml-node >> Kafka topic: topic.stream.none, message: xml-node

    • Kafka topic: sifxml.processed, message: TOPIC topic.stream m:n:doc-id + xml-node >> Kafka topic: topic.stream.low, message: redacted_extreme(xml-node)
    • Kafka topic: sifxml.processed, message: TOPIC topic.stream m:n:doc-id + xml-node >> Kafka topic: topic.stream.medium, message: redacted_high(redacted_extreme(xml-node))
    • Kafka topic: sifxml.processed, message: TOPIC topic.stream m:n:doc-id + xml-node >> Kafka topic: topic.stream.high, message: redacted_medium(redacted_high(redacted_extreme(xml-node)))
    • Kafka topic: sifxml.processed, message: TOPIC topic.stream m:n:doc-id + xml-node >> Kafka topic: topic.stream.extreme, message: redacted_low(redacted_medium(redacted_high(redacted_extreme(xml-node))))
  • Kafka topic: sifxml.processed, message: TOPIC topic.stream m:n:doc-id + xml-node >> Kafka topic: sms.indexer, message: tuple representing graph information of xml-node

  • Kafka topic: sms.indexer, message: tuple representing graph information of xml-node >> Redis: sets representing the contribution of xml-node to the graph of nodes

  • Kafka topic: sifxml.processed, message: TOPIC topic.stream m:n:doc-id + xml-node >> Moneta key: xml-node RefId, value: xml-node

  • HTTP POST staffcsv to /:naplan/:csv_staff/

    • Kafka topic: naplan.csv_staff, message: staffcsv

    • Kafka topic: sifxml.ingest , message: TOPIC naplan.sifxmlout_staff + csv2sif(staffcsv)

    • Kafka topic: naplan.sifxmlout_staff.none , message: csv2sif(staffcsv)

    • Moneta key: naplan.csv_staff::LocalId(staffcsv), value: staffcsv

  • HTTP POST studentcsv to /:naplan/:csv/ >> Kafka topic: naplan.csv, message: studentcsv

    • Kafka topic: sifxml.ingest , message: TOPIC naplan.sifxmlout + csv2sif(studentcsv)

    • Kafka topic: sifxml.processed , message: TOPIC naplan.sifxmlout + csv2sif(studentcsv) + Platform Student Identifier

    • Kafka topic: naplan.sifxmlout.none , message: csv2sif(studentcsv) + Platform Student Identifier

    • Moneta key: naplan.csv::LocalId(studentcsv), value: studentcsv

    • Kafka topic: naplan.filereport, message: report on contents of file

  • HTTP POST staffxml to /:naplan/:sifxml_staff/

    • Kafka topic: naplan.sifxml_staff.none, message: staffxml

    • Kafka topic: naplan.csvstaff_out , message: sif2csv(staffxml)

    • Moneta key: RefId(staffxml), value: staffxml

  • HTTP POST studentxml to /:naplan/:sifxml/

    • Kafka topic: sifxml.processed, message: studentxml + Platform Student Identifier

    • Kafka topic: naplan.sifxml.none, message: studentxml + Platform Student Identifier

    • Kafka topic: naplan.csv_out , message: sif2csv(studentxml)

    • Moneta key: RefId(studentxml), value: studentxml

    • Kafka topic: naplan.filereport, message: report on contents of file

REST calls

  • ssf/ssf_server.rb

    • Input: HTTP POST xml to /:topic/:stream
      • Format: xml : XML, assumed to be SIF collection
    • Output: Kafka stream sifxml.ingest
      • Content: xml with header TOPIC: topic.stream
    • Constraints: XML less than 1 MB
  • ssf/ssf_server.rb

    • Input: HTTP POST json to /:topic/:stream
      • Format: json : JSON
    • Output1: Kafka stream /:topic/:stream
      • Content: json
    • Output2: Kafka stream json.storage
      • Content: json with header TOPIC: topic.stream
    • Constraints: JSON less than 1 MB
  • ssf/ssf_server.rb

    • Input: HTTP POST csv to /:topic/:stream
      • Format: CSV
    • Output1: Kafka stream /csv/errors
      • Content: csv errors
      • Constraints: CSV is malformed
    • Output1: Kafka stream /csv/errors
      • Content: csv errors
      • Constraints: CSV violates CSV schema, for specific topics with defined schemas (naplan.csv_staff, naplan.csv)
    • Output3: Kafka stream /:topic/:stream
      • Content: csv parsed into JSON
      • Constraints: CSV is valid
    • Output4: Kafka stream json.storage
      • Content: csv parsed into JSON with header TOPIC: topic.stream
      • Constraints: CSV is valid
    • Constraints: XML less than 1 MB
  • ssf/ssf_server.rb

    • Input: HTTP POST xml to /:topic/:stream/bulk
      • Format: XML, assumed to be SIF collection
    • Output: Kafka stream sifxml.bulkingest
      • Content: xml with header TOPIC: topic.stream, split up into blocks of 950 KB, each but the last terminating in ===snip nn===
    • Constraints: XML less than 500 MB
  • ssf/ssf_server.rb

    • Input: HTTP GET from /:topic/:stream
    • Output: Stream of messages from Kafka stream /:topic/:stream
      • Content: JSON objects with keys data, key, consumer_offset, hwm (= highwater mark), restart_from (= next offset)
  • ssf/ssf_server.rb

    • Input: HTTP GET from /csverrors
    • Output: Stream of messages from Kafka streams csv.errors, sifxml.errors, naplan.srm_errors
      • *Content: Header: m:n:i type followed by the error message
        • type: type of error reported
        • m: ordinal number of message within the error type
        • n: total number of error messages within the error type
        • i: record id, if the errors are being reported per record rather than for an entire file

SSF microservices

  • cons-prod-sif-bulk-ingest-validate.rb

    • Input: Kafka stream sifxml.bulkingest
      • Format: sequence of XML fragments xml-fragment1, xml-fragment2, xml-fragment3, all but the last terminating in ===snip nn===; whole message has header TOPIC: topic.stream
    • Output 1: Kafka stream sifxml.errors
      • Content: All well-formedness or validation errors from parsing xml, the concatenation of xml-fragment1, xml-fragment2, xml-fragment3 ... (with ===snip nn=== stripped out)
      • Constraint: xml is malformed or invalid
    • Output 2: Kafka stream sifxml.validated
      • Content: xml-node1, xml-node2, xml-node3, ... : one message for each child node of xml, each message with header TOPIC: topic.stream m:n:doc-id (m-th record out of n, with a hash used to identify the document that the messages came from)
      • Constraint: xml is valid
  • cons-prod-sif-ingest-validate.rb

    • Input: Kafka stream sifxml.ingest
      • Format: SIF/XML collection xml
    • Output 1: Kafka stream sifxml.errors
      • Content: All well-formedness or validation errors from parsing xml. Add CSV original line if SIF document originates in CSV.
      • Constraint: xml is malformed or invalid
    • Output 2: Kafka stream sifxml.validated
      • Content: xml-node1, xml-node2, xml-node3, ... : one message for each child node of xml, each message with header TOPIC: topic.stream m:n:doc-id (m-th record out of n, with a hash used to identify the payload that the messages came from)
      • Constraint: xml is valid
  • cons-prod-sif-process.rb

    • Input: Kafka stream sifxml.ingest
      • Format: SIF/XML collection xml
    • Output 1: Kafka stream sifxml.processed
      • Content: xml subject to any needed processing
  • cons-prod-privacyfilter.rb

    • Input: Kafka stream sifxml.processed
      • Format: SIF/XML object xml-node with header TOPIC: topic.stream m:n:doc-id
    • Output 1: Kafka stream topic.stream.none
      • Content: xml-node with all content matching xpaths in privacyfilters/
    • Output 2: Kafka stream topic.stream.low
      • Content: xml-node with all content matching xpaths in privacyfilters/extreme.xpath redacted
    • Output 3: Kafka stream topic.stream.medium
      • Content: xml-node with all content matching xpaths in privacyfilters/extreme.xpath, privacyfilters/high.xpath redacted
    • Output 4: Kafka stream topic.stream.high
      • Content: xml-node with all content matching xpaths in privacyfilters/extreme.xpath, privacyfilters/high.xpath, privacyfilters/medium.xpath redacted
    • Output 5: Kafka stream topic.stream.extreme
      • Content: xml-node with all content matching xpaths in privacyfilters/extreme.xpath, privacyfilters/high.xpath, privacyfilters/medium.xpath, privacyfilters/low.xpath redacted

SMS microservices

  • cons-prod-sif-parser.rb

    • Input: Kafka stream sifxml.processed
      • Format: SIF/XML object xml-node with header TOPIC: topic.stream m:n:doc-id
    • Output: Kafka stream sms.indexer
      • Content: JSON tuple used to represent graphable information about xml-node:
        • type => xml-node name
        • id => xml-node RefId
        • links => other GUIDs in xml-node
        • label => human-readable label of xml-node
        • otherids => any alternate identifiers in xml-node
  • cons-sms-indexer.rb

    • Input: Kafka stream sms.indexer
      • Format: JSON tuple used to represent graphable information about xml-node
    • Output: Redis database
      • Content: Sets representing the contribution of xml-node to the graph of nodes
  • cons-sms-storage.rb

    • Input: Kafka stream sifxml.processed
      • Format: SIF/XML object xml-node with header TOPIC: topic.stream m:n:doc-id
    • Output: Moneta Key/Value database
      • Content: Key: xml-node RefId; Value: xml-node
  • cons-sms-json-storage.rb

    • Input: Kafka stream json.storage
      • Format: JSON object json with header TOPIC: topic.stream m:n:doc-id
    • Output: Moneta Key/Value database
      • Content: Key: topic.stream::id (where id is extracted from record); Value: json

NAPLAN microservices

  • cons-prod-csv2sif-staffpersonal-naplanreg-parser.rb

    • Input: Kafka stream naplan.csv_staff
      • Format: CSV following NAPLAN specification for staff records
    • Output1: Kafka stream sifxml.ingest
      • Content: TOPIC: naplan.sifxmlout_staff + CSV line {linenumber} + XML following NAPLAN specification for staff records
      • Constraint: CSV record has NAPLAN staff header LocalStaffId
      • Note: this payload will be validated as all other SIF payloads. If valid, it will end up on naplan.sifxmlout_staff.none
    • Output2: Kafka stream csv.errors
      • Content: Alert that wrong record type uploaded
      • Constraint: CSV record has NAPLAN student header LocaStaffId
  • cons-prod-csv2sif-studentpersonal-naplanreg-parser.rb

    • Input: Kafka stream naplan.csv
      • Format: CSV following NAPLAN specification for student records
    • Output1: Kafka stream sifxml.ingest
      • Content: TOPIC: naplan.sifxmlout + CSV line {linenumber} + XML following NAPLAN specification for student records
      • Constraint : JSON of CSV validates against sms/services/naplan.student.json
      • Note: this payload will be validated as all other SIF payloads. If valid, it will end up on naplan.sifxmlout.none
    • Output2: Kafka stream csv.errors
      • Content: Alert that wrong record type uploaded
      • Constraint: CSV record has NAPLAN staff header LocalId
  • cons-prod-sif2csv-studentpersonal-naplanreg-parser.rb

    • Input: Kafka stream naplan.sifxml_staff.none
      • Format: XML following NAPLAN specification for staff records
    • Output: Kafka stream naplan.csvstaff_out
      • Content: CSV following NAPLAN specification for staff records
  • cons-prod-sif2scv-studentpersonal-naplanreg-parser.rb

    • Input: Kafka stream naplan.sifxml.none
      • Format: XML following NAPLAN specification for student records
    • Output: Kafka stream naplan.csv_out
      • Content: CSV following NAPLAN specification for student records
  • cons-prod-studentpersonal-naplanreg-unique-ids-storage.rb

    • Input: Kafka stream sifxml.processed
      • Format: XML following NAPLAN specification for student records
    • Output1: Redis database
      • Content: Key: school+localid::(ASL School Id + Local Id) ; Value: XML RefId
      • Constraint: Key is unique in Redis
    • Output2: Redis database
      • Content: Key: school+name+dob::(ASL School Id + Given Name + Last Name + Date Of Birth) ; Value: XML RefId
      • Constraint: Key is unique in Redis
    • Output3: Kafka stream naplan.srm_errors
      • Content: Error report on duplicate key
      • Constraint: Key school+localid::(ASL School Id + Local Id) is not unique in Redis; Key: school+name+dob::(ASL School Id + Given Name + Last Name + Date Of Birth) is not unique in Redis, and xml has not been converted from CSV file
    • Output3: Kafka stream naplan.srm_errors
      • Content: Error report on duplicate key
      • Constraint: Key school+localid::(ASL School Id + Local Id) is not unique in Redis; Key: school+name+dob::(ASL School Id + Given Name + Last Name + Date Of Birth) is not unique in Redis, and xml has been converted from CSV file, containing a comment with CSV line number
  • cons-prod-naplan-studentpersonal-process-sif.rb

    • Input: Kafka stream sifxml.validated
      • Format: XML following NAPLAN specification for student records
    • Output: Kafka stream sifxml.processed
      • Content: XML following NAPLAN specification for student records + Platform Student Identifier (if not already supplied)
      • Constraint: XML source topic is naplan.sifxml or naplan.sifxmlout
  • cons-prod-sif2csv-SRM-validate.rb

    • Input: Kafka stream sifxml.processed
      • Format: XML following NAPLAN specification for student records
    • Output: Kafka stream naplan.srm_errors
      • Content: Any instances where XML violates the Student Registration Management system's validation rules
      • Constraint: XML source topic is naplan.sifxml or naplan.sifxml_staff
  • cons-prod-file-report.rb

    • Input: Kafka stream sifxml.processed
      • Format: XML following NAPLAN specification for student records
    • Output: Kafka stream naplan.filereport
      • Content: Report of number of schools, students, and student per year level, for each new doc-id seen (representing a new payload loaded into NIAS)
      • Constraint: XML source topic is naplan.sifxml or naplan.sifxml_staff