Growth Spurt Components - cs428TAs/w2024 GitHub Wiki

NOTE TO READER

Due to the fact that this project is part of a research lab, we have chosen to limit the team to the four currently signed up. If you still find this project utterly fascinating (this would be remarkably flattering) email me at [email protected] and we can discuss possible arrangements.

Growth Spurt Overview

The BYU Record Linking Lab specializes in finding computational methods of maintaining and growing the Family Tree on familysearch.org. Growth Spurt is an ongoing project which aims to grow the Family Tree by tracking record-person pairings which are suspected to be related (referred to as "hints" or "record hints"). When one of these hints has been verified (or "attached") via an automated or manual review process, we update the Family Tree to reflect this. Consequently, Family Search generates new hints using that new information.

HarvesterGS

When we have verified that every person on a record has been reviewed and attached to that record, Growth Spurt (via HarvesterGS) looks at the Family Tree profiles of each person found on the record and "harvests" new hints from Family Search.

ShakerGS

In addition to implementing this cycle, our project also includes the creation of a component called ShakerGS. This component will perform two functions: 1.) automatically decide whether to attach a hint and 2.) assign a difficulty level to a hint if it could not be automatically attached. Results from the former function will go in a separate database than results from the latter function.

TLDR;

In other words, Growth Spurt uses attached hints to locate brand new hints (HarvesterGS) and then attempts to attach the new hints (ShakerGS), thereby creating a self-sustaining cycle of hint attachment and hint discovery. Through this process, we hope to grow the Family Tree safely and efficiently.

The Project (HarvesterGS and ShakerGS)

Our project is to implement the HarvesterGS and ShakerGS components of Growth Spurt for a subset of hint types. While what we describe here represents the core functionality of the broader project, Growth Spurt contains significant extended functionality that we have not addressed here and which falls out of the realistic scope of a semester-long project.

Status Reports

PERT and GANTT

PERT

GrowthSpurt PERT Chart

Gantt

gantt
    dateFormat  YYYY-MM-DD
    title       Growth Spurt Components Gantt Chart
    %% excludes    weekends

    section Milestone 1
    Repo Setup  :done,  2024-01-29,2024-01-30
    Terraform    :done,  2024-01-30,2024-02-12
    Build Lambdas  :done,  2024-02-09, 2024-02-16
    Setup ECS  :active,  2024-02-09, 2024-03-25
    Integrate into Milestone 1  :done, 2024-02-16, 2024-03-05
    Milestone 1  :milestone, 2024-03-29,  0d

    section Milestone 2
    Set up FamilySearch in CI/CD  :done, 2024-02-15, 2024-03-16
    Find PIDs  :done 2024-03-05, 2024-03-09
    Find ARKs  :done 2024-03-05, 2024-03-09
    Integrate into Milestone 2  :done 2024-03-09, 2024-03-22
    Milestone 2  :milestone, 2024-03-25,  0d

    section Milestone 3
    Serializer Uploader  :2024-03-25, 2024-04-05
    Connect Serializer  :2024-03-25, 2024-04-05
    ShakerGS prototype  :2024-03-25, 2024-04-05
    Integrate into Milestone 3  :2024-04-05, 2024-04-15
    Milestone 3  :milestone, 2024-04-15,  0d
Loading

Requirements

Milestones

Milestone 1 - Prototype

Running version of Harvester with no implemented functionality, with the goal of having "at every stage in the process, a working system" (MMM, Ch. 16). It simply takes in a hint from an S3 bucket, and outputs the same hint to a different bucket.

Milestone 2 - Find hints

Harvester is now able to take in a hint, identify the people listed on the hint, identify the record hints associated with those people, and retrieve those hints. It is still going from one S3 bucket to another. In theory we can use existing lab tools to do this, but one major known unknown is the reliability of these tools.

Milestone 3 - Other Components

Integrate Harvester's inputs and outputs—the input bucket gets populated by another module called Shaker, which gets them from one called Serializer. Serializer is being built by another team, but we will need to build a stubbed-out prototype for Shaker (similar to M1 for Harvester) and then hook it up to Serializer and Harvester. Finally, Harvester outputs hints into an S3 bucket to be processed by other projects. Once those hints are complete, they will be fed back to Serializer

What we are not building

  • The full functionality for Shaker will need to be built, but not during this semester.
  • Eventually the system will need to handle many types of hints. For now, it will only handle record attachment hints.

Identify hard problems

  • Retrieving person information from a record and record information from a person
  • Integrating different components created by different teams in the lab
  • Relying on the functionality of Serializer, which is out of the control of our team and is being developed in parallel.

Architecture & Design

Architecture and Design Document

Test Plan

Growth Spurt Test Plan

Diagram

This project includes the two marked parts of the below Growth Spurt diagram:

GrowthSpurt

Team

flowchart BT;
  pm["<strong>Project Manager</strong><br> Zarin Loosli"];
  ca["<strong>Chief Architect</strong><br> Hunter Corry"];
  a["<strong>Architect</strong><br> Ben Giles"];
  dm["<strong>Development Manager</strong><br> Sam Carlsen"];
  tm["<strong>Testing Manager</strong><br> Ben Giles"];
  d["<strong>Developer</strong><br> Zarin Loosli<br> Hunter Corry<br> Ben Giles"];
  t["<strong>Tester</strong><br> Hunter Corry<br> Sam Carlsen<br> Zarin Loosli"];
  %% ca <--> pm;
  %% pm --- ca;
  dm --> pm;
  dm --> ca;
  tm --> dm;
  a --> ca;
  d --> dm;
  t --> tm;
Loading

Role descriptions

Project Manager: Primary interface between the TAs and the team. Responsible for keeping track of assignments and deadlines.

Chief Architect: Primary source of vision and design for the project. Defines project structure and interfaces to be implemented by the Development Manager's team.

Development Manager: Primarily responsible for implementation of the Chief Architect's design. Reports to the Project Manager about development progress and the Chief Architect about potential issues with the design. Works with the Testing Manager to coordinate testing of developed code.

Developer: Receives assignments from the Development Manager to implement specific features, and carries them out.

Testing Manager: Primary responsible for the quality assurance of the final project. Ensures that the final product has complete test coverage.

Tester: Receives assignments from the Testing Manager to implement specific tests, and carries them out.

⚠️ **GitHub.com Fallback** ⚠️