Caseflow Reader - department-of-veterans-affairs/caseflow GitHub Wiki

What is Reader

Reader is a product developed for attorneys and judges at the Board of Veteran Appeals. Caseflow Reader is NEVER available outside of the Board. Not even other components of VA, such as VBA or OGC, have access to Reader. Attorneys and judges need to review 100's or sometimes 1000's of documents for every decision they write. Reader provides a fast and powerful interface for reviewing these documents.

How to access Reader

Users need to have the "Reader" CSS role to access Reader. In our development mode all users have the "Reader" role, but BVASCASPER1 is the best user to test Reader with since they have several appeals to review. To open Reader go to the switch users page. Then proceed to the user's Queue. Either by going to /queue or clicking on the Queue link on the switch users page.

images/NavigateToQueue.png

From the user's Queue you can navigate to Reader for any of the cases assigned to them. These cases have different numbers and different types of documents to make it easy to test different scenarios.

images/ViewDocuments.png

Screens

Document List

images/ReaderDocumentList.png

The document list is the page users will land on first. It's a directory of all the available documents. The image above is numbered to point out some of the major features:

This accordion expands to show the user some basic information about the appeal such as the docket number, and what issues are on appeal. This feature may now be obviated by Queue's Case Details page.
Reader provides metadata search to allow users to quickly find the document they're looking for. Users can search by date, document type, comments, issue tags, or category.
The table allows users to both filter and sort by various columns to make case review as efficient as possible.
Each row in the table shows metadata of the document, as well as a link to the document viewer. The link is bolded, until clicked on. Once a user has viewed a document it becomes unbolded (like a G-Mail inbox). The most recently viewed document is denoted by a small triangle.
By clicking on the "Comments" button in this toggle users can see a list of all their comments on every document in the order of the comment's optional date field. Users can click on any of these comments to be taken to the document viewer for that document, then the viewer will automatically jump to the relevant comment.

Document View

images/ReaderDocumentView.png

The document view is what users see when they click on the link on the document list page. It renders a PDF as well as the metadata associated with this document. The image above is numbered to point out some of the major features:

The top toolbar is focused on manipulating the document. The + and - buttons zoom in and out respectively. The double arrow button will fit the PDF to the page, such that each page of the PDF fits exactly in the viewport. The circle arrow button will rotate the document by 90 degrees. The down arrow button will download the document to the users machine. The magnifying glass opens up a text search within the document.
The main viewing area renders the PDF. Standard navigation works in here. Scrolling, using the up/down arrow keys, or using PageUp and PageDown to jump between pages. Comments will also appear on the rendered document. Clicking on a comment will scroll to the comment text in the comment section (#8).
The bottom toolbar is focused on moving between documents. The Previous and Next buttons will move to the previous and next document within the list of documents. The page 1 of 4 shows the user what page they're currently on and allows them to jump to any page.
This accordion can be expanded to show basic document information like the document type, when it was uploaded, and a user editable document description.
All of these sections are accordions allowing users to expand or contract them to see more information on the screen.
Categories are a way for users to specify what type of document this is. Procedural, Medical, and Other Evidence are user editable while Case Summary is automatically selected based on the document type. Selecting a category here will then show up on the document list page and allow a user to filter or search for documents using that metadata.
1. If the Document Type matches one of the values in CASE_SUMMARY_TYPES the document it will have a category of Case Summary
Issue tags allow users to add any number of arbitrary text metadata to a document. The intended purpose for this field is to let users categorize documents based on what issue the Veteran is appeal. For example if a Veteran has an appeal for their left knee and right shoulder, some documents might apply to one, the other, or both. Users can classify each document based on what it describes and then using that metadata on the document list page to filter or search by. Note that in reality users utilize this feature to specify lots of different types of metadata, not just around what issue on appeal this document applies to.
Users can add a comment by clicking on the "Add a comment" button, then clicking on the part of the document they want to add the comment to. They'll then be prompted with a text field for the comment text, and an optional date field that they can use to tag the comment with the relevant date an event happened. Comments appear on the document list page, and are sorted by this date. Users can utilize this feature to build a timeline of the case. What happened when, and review all their comments in that order in a single place. Users can click on the text of a comment to jump the main viewing area (#2) to the comments location in the document.

Integrations

ArchitectureDiagrams/ReaderArchitecture.png

VBMS

VBMS is the ultimate source of truth of Veteran documents. Each Veteran has a claims folder with documents. Reader doesn't directly integrate with VBMS, instead, Reader pulls all of the documents from another Caseflow product, eFolder Express. eFolder Express acts as Caseflow's document service, offering a number of features that make it easier to integrate with than calling VBMS directly.

eFolder Express has an endpoint /api/v2/manifests that lists all of the documents in a given Veteran's Claims Folder. This endpoint is a polling endpoint, meaning that we start by POSTing to this endpoint which prompts eFolder Express to kick off a job which downloads a list of the documents from VBMS. Then we poll the endpoint with GET requests which returns the list of documents when the job is complete. We have this structure since for large claims folders VBMS can take a very long time to respond. Note that if eFolder Express sees that the most recent call to VBMS was less than 3 hours ago, it does not make another call to VBMS and instead uses the cached value. In this way we shield the user from some of VBMS's slower endpoints. The returned manifest has a list of documents along with their associated metadata. Each document has a series_id and a version_id (unfortunately we refer to version_id as vbms_document_id in most of the code). In VBMS a document may be uploaded with multiple versions. Each version of the document gets its own version_id, but will have the same series_id. Whenever we see a new document with the same series_id as an existing document, we copy over all the metadata (comments, tags, etc.) we'd associated with that first document.

We return the list of documents to the frontend which then makes calls directly to the /api/v2/records/:id endpoint on eFolder Express to retrieve the content of a document. Note that when the backend calls the manifest endpoint, eFolder Express kicks off a separate job to start download the content of all of the documents, so that when the frontend later makes requests for content, the document contents should already be cached in S3.

VVA

There is a second store of documents known as Virtual VA (or sometimes referred to as the Legacy Content Manager). eFolder Express integrates with VVA, so when we call the manifest endpoint we get VVA as well as VBMS documents. We can also retrieve VVA documents via the record endpoint.

Unfortunately, not all of the documents in VVA are PDFs. Some are instead TIFF files. Our frontend can only render PDFs so we've set up a small Docker service which uses ImageMagick to convert TIFF files into PDFs. There are a small set of TIFFs which currently cannot be converted like this that we can therefore not render.

VACOLS

Reader uses VACOLS to get metadata for legacy appeals. This data appears in the Claims Folder Details accordion as well as the Document Information accordion on the Document View page.

BGS

Reader uses BGS to get metadata for appeals. Just like the information pulled from VACOLS, the data appears in the Claims Folder Details and Document Information accordions.

Frontend

Reader is a frontend heavy application. We used two main libraries to build the PDF reader, PDFJS and React Virtualized.

PDFJS

PDFJS is an open source PDF renderer written in JavaScript. It's maintained by Mozilla. PDFJS has a complete PDF viewer solution, with a standard UI to go along with the PDF renderer. The PDF renderer packaged with FireFox is built using PDFJS. However, we decided to build out our own UI around the PDF rendering engine provided by PDFJS. There were two motivations for this:

We wanted the ability to add comments which would have required a lot of hacking to get working with the default PDFJS UI.
We wanted to render multiple PDFs at once, both the currently viewed PDF as well as the next and previous documents. This way transitioning between them would be smooth. Doing this with the full viewer felt complex.

We structured our classes using the same hierarchy that PDFJS provides. At the top level is PdfFile.jsx which corresponds to PdfDocument. In our componentDidMount we make the API request to retrieve the document and then pass the result to PDFJS to create a PdfDocument object. We set this document in the redux store. PdfFile.jsx renders the document in a Grid structure defined by React Virtualized. We'll dive more into what exactly this means soon, but for now assume that it acts like a table. Each row of this table is defined by PdfPage.jsx. PdfFile also asyncronusly gets the dimensions for all of the pages in the PDF and sticks the results into the Redux store.

PdfPage.jsx maps to PDFJS's PdfPage. In componentDidMount we call getPage on the document passed in. This returns a promise that when successful returns a PdfPage object. We can then use this PdfPage object to get at the data on the page.

The first method it provides is render. render takes a canvas context and a viewport object (retrieved by calling PdfPage's getViewport method with a scale), to render the page on the canvas. render returns a promise, that succeeds when the page is fully drawn. We use this to maintain the state of drawing the page. We store a boolean isDrawing. When true, we cannot try to render the page a second time. When the promise succeeds or fails, we set this variable to false. Similarly if the status of the page has changed during rendering, i.e. if the document has been zoomed while rendering, we need to draw it again at the new scale. This is checked in the then block of the promise.

The second method is getTextContent. This method returns a promise that on success returns the text content of the document. This can be passed to another PDFJS function renderTextLayer, which will render the text on a div. In all we render the PDF in the first layer, comments get rendered above that, and text gets rendered at the top level so that it can be selected.

Note that all of the actual rendering happens outside of the standard React lifecycle. This can lead to weird situations where the document is switched while rendering is happening. In this case, we need to clean up all the resources. In our componentWillUnmount we cleanup the page object, and cancel any ongoing rendering.

Another complex feature is how to deal with zooming and rotating the document. Zooming triggers a re-render since we need to draw everything at a new scale. When calling the drawPage method, we pass the new scale of the page into PDFJS's getViewport function, to define the scale we want to render at (Zoom 1 is standard, when zooming in we add .3 to the zoom. Zooming out subtracts .3 from the zoom). We call the getDivDimensions method on PdfPage to get the dimensions of a page. We derive this by grabbing the page dimension from the redux store, scaling it by the zoom factor (i.e. multiply width and height by the value in scale). Then we determine if the PDF is rotated either 90 or 270 degrees, in which case we swap the height and the width. i.e. rotated PDFs have heights for widths and vice versa. To rotate the PDF we use CSS transformations to quickly rotate the view.

React Virtualized

Originally we would render an entire document in the DOM at one time. However, this would lead to a major browser slow down for large documents. To address this, we started using the React Virtualized windowing library. React Virtualized provides a number of primitives that enable windowed rendering. We use the Grid primitive. Grids are 2D lists, i.e. you can zoom up and down as well as left and right. React Virtualized renders the cell that is currently visible, plus all cells in a small range around the currently visible cell. These rendered cells that are off screen are defined by the overscan count. The larger the overscan count, the smoother scrolling is, but the more that has to be rendered at one time. As pages that were rendered move out of the overscan section, the pages are unmounted, and new pages that enter the section are rendered.

As users zoom out, if we can fit more than one page across the width of the screen, we add extra columns to our grid. This requires a number of calculations. We need to tell the Grid how many columns and rows we have, and then when it asks us for a page to render in our getPage function, we need to find the right page based on the row and column.

We also need to provide the width and height of each row to enable React Virtualized to calculate how large of a scroll bar to make. We do this by asynchronously getting all the page dimensions when a PdfFile is mounted and then using these to provide React Virtualized with the size of each page.

Optimizations

We aim to empower our users to review documents as quickly as possible. Load times are both frustrating for users and hurt Veterans by making VA employees less efficient at their work. So making the application quick, is a worthwhile investment. Here we discuss two of the more intricate technical decisions we've made to improve Reader performance.

Rendering multiple documents

only occurs in Reader V1.0

On the document view page, we don't just render the current document, but also the next and previous documents. The advantage here is that flipping between documents becomes much smoother. They're already rendered, and the initial processing is already done. Note, this can make debugging the rendering flow tricky because adding a breakpoint in the main flow might get triggered for all of the rendered documents, even those that are invisible.

Caching documents in S3

The first time a user views a claims file in Caseflow (visits Reader's Document list page), the product fetches the file and its contents from VBMS and caches them in the system for 5 days. The Caseflow user can view those documents for the 5 days regardless of VBMS being out of service.

After 5 days from the initial claims file cache from VBMS, Caseflow waits on the next user attempt to view the claims file to cache its contents from VBMS again.

The RetrieveDocumentsForReaderJob job runs regularly, pulling the list of appeals assigned to any active users of Reader. This job calls eFolder to pull all the documents for those appeals, caching them in S3. When a user later tries to access one of these cases, they can download the documents very quickly.

More details can be found in the Reader backend wiki.