Reader Backend - TISTATechnologies/caseflow GitHub Wiki

Reader Backend

There are 2 pages for Caseflow Reader:

  • Document List page - shows a table of documents for the Veteran
  • Document View page - shows one document at a time

To help diagnose problems, a description of backend calls for each page are described below, followed by a list of recent known and resolved problems. Reader calls eFolder Express, which does the actual document retrieval from VBMS and VVA, so a section is dedicated to describing that aspect of eFolder.

Document List page

Upon load or page refresh, the Document List page makes 2 requests to the backend:

  1. Reader::AppealController#show returns info about the current appeal.
  2. Reader::DocumentsController#index returns an object with the following:
    • documents: a list of document records and associated document tags (aka "Issue Tags" in Reader). To get the documents, the controller calls appeal.document_fetcher.find_or_create_documents!
      • In this case, document_fetcher uses EFolderService for AMA and Legacy appeals (see Integrations). Upon document_fetcher initiation, it
        • retrieves document metadata from eFolder Express with EFolderService.fetch_documents_for(appeal, user) (next section has further descriptions)
        • then sets manifest_vbms_fetched_at and manifest_vva_fetched_at (which are also sent back to the frontend)
      • Upon find_or_create_documents! being called, it ensures versions of Documents can be tracked by series_id as follows:
    • annotations (aka document "Comments" in Reader)
      • also calls appeal.document_fetcher.find_or_create_documents!
    • manifestVbmsFetchedAt: timestamp indicating when documents were fetched from VBMS
    • manifestVvaFetchedAt : timestamp indicating when documents were fetched from VVA

series_id and vbms_document_id

  • A specific version of a document is referenced by its vbms_document_id.
  • All versions of the same document have the same series_id.
  • So there may be Document records that represent older versions of documents that are (correctly) not presented in the UI. As a result, Document.where(file_number: vet.file_number).count is not equal to documents.size returned from Reader::DocumentsController.

From Reader's VBMS integration:

Each document has a series_id and a version_id (unfortunately we refer to version_id as vbms_document_id in most of the code). In VBMS a document may be uploaded with multiple versions. Each version of the document gets its own version_id, but will have the same series_id. Whenever we see a new document with the same series_id as an existing document, we copy over all the metadata (comments, tags, etc.) we'd associated with that first document.

EfolderService.fetch_documents_for(appeal, user)

EfolderService is a client for the eFolder service (aka Caseflow eFolder Express, not to be confused with VBMS eFolder). EfolderService.fetch_documents_for is used by Reader to download documents from VBMS and VVA.

Document View page

From Reader's Document View:

[The frontend] makes calls directly to the /api/v2/records/:id endpoint on eFolder Express to retrieve the content of a document. [...] the document contents should already be cached in S3.

  1. With each document shown to the user, DocumentController#pdf is called for the current, next, and previous documents. (Note this is not the same Reader::DocumentsController used for the Document List page above.)
    • It serves up the pdf file from directory /tmp/pdfs/. The pdf could come from 3 places:

      Currently three levels of caching. Try to serve content from memory, then look to S3 if it's not in memory, and if it's not in S3 grab it from VBMS Log where we get the file from for now for easy verification of S3 integration.

    • So if the document is not in S3 and comes from VVA, then Reader won't be able to show it. Should investigate a solution.
    • Can check in Rails logs for "File #{vbms_document_id} fetched from VBMS"
  2. DocumentController#mark_as_read updates DocumentView records to capture when the user views the document
  3. Reader::DocumentsController#show sets up the page content
  4. Reader::AppealController#show returns info about the current appeal
  5. Metrics::V1::HistogramController#create sends a histogram to DataDog about pdf_page_render_time_in_ms but values seem to always be 0: [{"group":"front_end","name":"pdf_page_render_time_in_ms","value":0,"app_name":"Reader","attrs":{"overscan":6,"document_type":"VA Memo","page_count":4}}, ...]

Documents cached in S3

Reader pulls document files from S3, if they're available. A RetrieveDocumentsForReaderJob caches documents in S3:

Concerns:

  • 5 minutes may be too frequent. Could the same 5 users be chosen by consecutive jobs if the first job is still processing? Since efolder_documents_fetched_at is not set until a job finishes, if the first job takes longer than 5 minutes (e.g., 1000+ documents) then the next job would pick the same users. Should investigate improvements to this.
  • How often is S3 used compared to document retrievals from VBMS/VVA? The intent of the job is to retrieve preferably all documents from S3. Should measure how well this job is achieving this intent and improve it, while considering S3 file auto-deletions.

When are these files auto-deleted in S3?

Asked Tango: Slack convo

Some digging reveals this:

bucket=Caseflow::S3Service.init!
client = Aws::S3::Client.new
resp=client.get_bucket_lifecycle({bucket: bucket.name})
pp resp.rules.pluck(:id,:prefix)
[["delete form 8s after 5 days", "form_8 "],
 ["delete documents after 5 days", "documents"]]

The earliest file in the S3 documents folder is 5 days ago (AWS S3 web UI shows folder contents), so Reader documents are indeed deleted after 5 days.

Doc counts in Reader

In the Reader UI, document counts are displayed to the user. It can be simulated as follows:

appeal=Appeal.find_by(uuid: ...)
docs=Document.where(file_number: appeal.veteran_file_number)

# Document List page
page1resp=ExternalApi::EfolderService.document_count(appeal.veteran_file_number,user)

# Document View page
page2resp=ExternalApi::EfolderService.fetch_documents_for(appeal,user)
page2resp[:documents].size

These document counts can change over time. For example,

  • 2 Document records were created and retrieved but are no longer retrievable by eFolder, possibly because new versions are available.
  • eFolder has new 1 Record that Reader doesn't yet know about, possibly because a new document was uploaded to VBMS/VVA.
  • The net document count change may be 1, but there are 3 differences. Should investigate a better way to track documents.

Some code for further investigation:

docs=Document.where(file_number: appeal.veteran_file_number)
vbms_idsD=docs.pluck(:vbms_document_id)

df=appeal.document_fetcher # takes many seconds to complete
df.number_of_documents
df.documents.group_by{|d| d.upload_date.beginning_of_day}.map{|k,v| [k,v.size]}.sort
df.documents.group_by{|d| d.received_at.beginning_of_day}.map{|k,v| [k,v.size]}.sort
vbms_idsR=df.documents.pluck(:vbms_document_id)

vbms_idsD - vbms_idsR
=> ["{2605FFFC-C9C7-4EF8-BAB8-E1042CB7A92F}", 
    "{50FB8137-8D01-431E-B71D-55F8A6BC7F09}"]
vbms_idsR - vbms_idsD
=> ["{2B507FFF-0CF2-41DA-92A4-0394D3BBF52A}"]

Doc counts in Queue page

Document counts are shown in the table on some Queue pages.

AppealsController#document_count provides these document counts. It calls EFolderService.document_count(appeal.veteran_file_number, current_user), which:

  • checks Rails.cache "Efolder-document-count-#{file_number}"
  • checks Rails.cache "Efolder-document-count-bgjob-#{file_number}" (expires_in: 15.minutes)
  • starts background FetchEfolderDocumentCountJob, which checks Rails.cache "Efolder-document-count-#{file_number}" (expires_in: 4.hours) and sends GET request to /api/v2/document_counts
    • In response, eFolder Express api/v2/document_counts#index checks its cache veteran-doc-count-#{file_number} (expires_in: 2.hours) and responds with DocumentCounter.new(veteran_file_number: file_number)
      • which calls v2_fetch_documents_for(file_number) for both VBMSService and VVAService (same as ManifestFetcher mentioned in the context of Reader's Document List page), and then returns a set of document_ids, which is counted.

Known Problems

  1. PDF version of TIFF from VVA not shown b/c the TIFF(not the PDF) is in S3 and cannot be immediately retrieved like a VBMS-sourced file. #14193
    • Which documents fail conversion?
  2. Why document counts change over time? e.g., 421 + 5 more: #14289
    • Why 425 vs 426? 2 gone + 1 added; VBMS's response changes over time
    • 6/2/2020: Now 440. docs.pluck(:series_id).uniq.size => 440
    • Need to better synchronize documents with VBMS/VVA.
  3. Is the same job submitted within the same time span? "active user" check and limited to 5 at a time
  4. [Should no longer be a problem] Document count numbers are not the same in Queue and Reader (Related resolved ticket due to VBMS pagination)