Reader Backend - department-of-veterans-affairs/caseflow GitHub Wiki

Reader Backend

There are 2 pages for Caseflow Reader:

Document List page - shows a table of documents for the Veteran
Document View page - shows one document at a time

To help diagnose problems, a description of backend calls for each page are described below, followed by a list of recent known and resolved problems. Reader calls eFolder Express, which does the actual document retrieval from VBMS and VVA, so a section is dedicated to describing that aspect of eFolder.

Document List page

Upon load or page refresh, the Document List page makes 2 requests to the backend:

Reader::AppealController#show returns info about the current appeal.
Reader::DocumentsController#index returns an object with the following:
- documents: a list of document records and associated document tags (aka "Issue Tags" in Reader). To get the documents, the controller calls appeal.document_fetcher.find_or_create_documents!
  - In this case, document_fetcher uses EFolderService for AMA and Legacy appeals (see Integrations). Upon document_fetcher initiation, it
    - retrieves document metadata from eFolder Express with EFolderService.fetch_documents_for(appeal, user) (next section has further descriptions)
    - then sets manifest_vbms_fetched_at and manifest_vva_fetched_at (which are also sent back to the frontend)
  - Upon find_or_create_documents! being called, it ensures versions of Documents can be tracked by series_id as follows:
    - it calls DocumentSeriesIdAssigner to ensure all known Documents have a series_id
    - and merges fetched documents with known Document records or, if unknown, creates a new Document (copying annotations/comments from previous doc with the same series_id)
- annotations (aka document "Comments" in Reader)
  - also calls appeal.document_fetcher.find_or_create_documents!
- manifestVbmsFetchedAt: timestamp indicating when documents were fetched from VBMS
- manifestVvaFetchedAt : timestamp indicating when documents were fetched from VVA

`series_id` and `vbms_document_id`

A specific version of a document is referenced by its vbms_document_id.
All versions of the same document have the same series_id.
So there may be Document records that represent older versions of documents that are (correctly) not presented in the UI. As a result, Document.where(file_number: vet.file_number).count is not equal to documents.size returned from Reader::DocumentsController.

From Reader's VBMS integration:

Each document has a series_id and a version_id (unfortunately we refer to version_id as vbms_document_id in most of the code). In VBMS a document may be uploaded with multiple versions. Each version of the document gets its own version_id, but will have the same series_id. Whenever we see a new document with the same series_id as an existing document, we copy over all the metadata (comments, tags, etc.) we'd associated with that first document.

`EfolderService.fetch_documents_for(appeal, user)`

EfolderService is a client for the eFolder service (aka Caseflow eFolder Express, not to be confused with VBMS eFolder). EfolderService.fetch_documents_for is used by Reader to download documents from VBMS and VVA.

First it sends a POST request to /api/v2/manifests (see Reader access to VBMS)
- In response to the POST request, eFolder Express (specifically Api::V2::ManifestsController#start) creates a Manifest (and a corresponding FilesDownload per current_user) for the Veteran. A Manifest typically has 2 ManifestSources -- one for each of VBMS and VVA.
  - Schema diagram of relevant eFolder Express classes
  - It starts to retrieve documents for each ManifestSource using a high_priority V2::DownloadManifestJob parameterized by the current_user. V2::DownloadManifestJob does the following:
    - uses ManifestFetcher to fetch a list of documents for all the "file numbers" known for the veteran using BGS info. The actual document-list fetching is done by calling v2_fetch_documents_for(file_number) on VBMSService and VVAService. A DocumentCreator is used to delete and recreate all Records associated with the manifest_source, after applying DocumentFilter.
    - then it starts a low_priority V2::SaveFilesInS3Job to retrieve the documents' contents and store them as files in S3: manifest_source.records.each(&:fetch!)
      - A Record corresponds to a Document to be retrieved by RecordFetcher, which will fetch the contents from S3 before trying VBMS/VVA, convert images to PDF files if needed, and save to S3.
      - If conversion to PDF fails, the image is saved to S3 (however Reader can only show PDFs) and no alert is logged. Should investigate a solution and at least log the error when record.conversion_status==conversion_failed.
Once all documents for the appeal are fetched, EfolderService sends a GET request to /api/v2/manifests/#{manifest_id} to return the retrieved documents.

Document View page

From Reader's Document View:

[The frontend] makes calls directly to the /api/v2/records/:id endpoint on eFolder Express to retrieve the content of a document. [...] the document contents should already be cached in S3.

With each document shown to the user, DocumentController#pdf is called for the current, next, and previous documents. (Note this is not the same Reader::DocumentsController used for the Document List page above.)
- It serves up the pdf file from directory /tmp/pdfs/. The pdf could come from 3 places:
  
  Currently three levels of caching. Try to serve content from memory, then look to S3 if it's not in memory, and if it's not in S3 grab it from VBMS Log where we get the file from for now for easy verification of S3 integration.
- So if the document is not in S3 and comes from VVA, then Reader won't be able to show it. Should investigate a solution.
- Can check in Rails logs for "File #{vbms_document_id} fetched from VBMS"
DocumentController#mark_as_read updates DocumentView records to capture when the user views the document
Reader::DocumentsController#show sets up the page content
Reader::AppealController#show returns info about the current appeal
Metrics::V1::HistogramController#create sends a histogram to DataDog about pdf_page_render_time_in_ms but values seem to always be 0: [{"group":"front_end","name":"pdf_page_render_time_in_ms","value":0,"app_name":"Reader","attrs":{"overscan":6,"document_type":"VA Memo","page_count":4}}, ...]

Documents cached in S3

Reader pulls document files from S3, if they're available. A RetrieveDocumentsForReaderJob caches documents in S3:

According to serverless.yml, this job runs every 5 minutes for active Reader users.
This job chooses up to 5 users who (1) logged in within the last week and (2) haven't used eFolder to fetch documents at all or not within the last day.
For the Legacy and AMA appeals these users are assigned to, the job calls appeal.document_fetcher.find_or_create_documents! -- same as on Reader's Document List page.
https://github.com/department-of-veterans-affairs/caseflow-efolder/blob/master/app/services/record_fetcher.rb

Concerns:

5 minutes may be too frequent. Could the same 5 users be chosen by consecutive jobs if the first job is still processing? Since efolder_documents_fetched_at is not set until a job finishes, if the first job takes longer than 5 minutes (e.g., 1000+ documents) then the next job would pick the same users. Should investigate improvements to this.
How often is S3 used compared to document retrievals from VBMS/VVA? The intent of the job is to retrieve preferably all documents from S3. Should measure how well this job is achieving this intent and improve it, while considering S3 file auto-deletions.

When are these files auto-deleted in S3?

Asked Tango: Slack convo

Some digging reveals this:

bucket=Caseflow::S3Service.init!
client = Aws::S3::Client.new
resp=client.get_bucket_lifecycle({bucket: bucket.name})
pp resp.rules.pluck(:id,:prefix)
[["delete form 8s after 5 days", "form_8 "],
 ["delete documents after 5 days", "documents"]]

The earliest file in the S3 documents folder is 5 days ago (AWS S3 web UI shows folder contents), so Reader documents are indeed deleted after 5 days.

Doc counts in Reader

In the Reader UI, document counts are displayed to the user. It can be simulated as follows:

appeal=Appeal.find_by(uuid: ...)
docs=Document.where(file_number: appeal.veteran_file_number)

# Document List page
page1resp=ExternalApi::EfolderService.document_count(appeal.veteran_file_number,user)

# Document View page
page2resp=ExternalApi::EfolderService.fetch_documents_for(appeal,user)
page2resp[:documents].size

These document counts can change over time. For example,

2 Document records were created and retrieved but are no longer retrievable by eFolder, possibly because new versions are available.
eFolder has new 1 Record that Reader doesn't yet know about, possibly because a new document was uploaded to VBMS/VVA.
The net document count change may be 1, but there are 3 differences. Should investigate a better way to track documents.

Some code for further investigation:

docs=Document.where(file_number: appeal.veteran_file_number)
vbms_idsD=docs.pluck(:vbms_document_id)

df=appeal.document_fetcher # takes many seconds to complete
df.number_of_documents
df.documents.group_by{|d| d.upload_date.beginning_of_day}.map{|k,v| [k,v.size]}.sort
df.documents.group_by{|d| d.received_at.beginning_of_day}.map{|k,v| [k,v.size]}.sort
vbms_idsR=df.documents.pluck(:vbms_document_id)

vbms_idsD - vbms_idsR
=> ["{2605FFFC-C9C7-4EF8-BAB8-E1042CB7A92F}", 
    "{50FB8137-8D01-431E-B71D-55F8A6BC7F09}"]
vbms_idsR - vbms_idsD
=> ["{2B507FFF-0CF2-41DA-92A4-0394D3BBF52A}"]

Doc counts in Queue page

Document counts are shown in the table on some Queue pages.

AppealsController#document_count provides these document counts. It calls EFolderService.document_count(appeal.veteran_file_number, current_user), which:

checks Rails.cache "Efolder-document-count-#{file_number}"
checks Rails.cache "Efolder-document-count-bgjob-#{file_number}" (expires_in: 15.minutes)
starts background FetchEfolderDocumentCountJob, which checks Rails.cache "Efolder-document-count-#{file_number}" (expires_in: 4.hours) and sends GET request to /api/v2/document_counts
- In response, eFolder Express api/v2/document_counts#index checks its cache veteran-doc-count-#{file_number} (expires_in: 2.hours) and responds with DocumentCounter.new(veteran_file_number: file_number)
  - which calls v2_fetch_documents_for(file_number) for both VBMSService and VVAService (same as ManifestFetcher mentioned in the context of Reader's Document List page), and then returns a set of document_ids, which is counted.

Known Problems

PDF version of TIFF from VVA not shown b/c the TIFF(not the PDF) is in S3 and cannot be immediately retrieved like a VBMS-sourced file. #14193
- Which documents fail conversion?
Why document counts change over time? e.g., 421 + 5 more: #14289
- Why 425 vs 426? 2 gone + 1 added; VBMS's response changes over time
- 6/2/2020: Now 440. docs.pluck(:series_id).uniq.size => 440
  - Added details at #14081 Investigation Part-3
- Need to better synchronize documents with VBMS/VVA.
Is the same job submitted within the same time span? "active user" check and limited to 5 at a time
[Should no longer be a problem] Document count numbers are not the same in Queue and Reader (Related resolved ticket due to VBMS pagination)