Reader Backend - TISTATechnologies/caseflow GitHub Wiki
Reader Backend
There are 2 pages for Caseflow Reader:
- Document List page - shows a table of documents for the Veteran
- Document View page - shows one document at a time
To help diagnose problems, a description of backend calls for each page are described below, followed by a list of recent known and resolved problems. Reader calls eFolder Express, which does the actual document retrieval from VBMS and VVA, so a section is dedicated to describing that aspect of eFolder.
Document List page
Upon load or page refresh, the Document List page makes 2 requests to the backend:
Reader::AppealController#show
returns info about the current appeal.Reader::DocumentsController#index
returns an object with the following:documents
: a list of document records and associated document tags (aka "Issue Tags" in Reader). To get the documents, the controller callsappeal.document_fetcher.
find_or_create_documents!- In this case,
document_fetcher
usesEFolderService
for AMA and Legacy appeals (see Integrations). Upondocument_fetcher
initiation, it- retrieves document metadata from eFolder Express with EFolderService.fetch_documents_for(appeal, user) (next section has further descriptions)
- then sets
manifest_vbms_fetched_at
andmanifest_vva_fetched_at
(which are also sent back to the frontend)
- Upon find_or_create_documents! being called, it ensures versions of
Document
s can be tracked byseries_id
as follows:- it calls
DocumentSeriesIdAssigner
to ensure all knownDocument
s have aseries_id
- and merges fetched documents with known
Document
records or, if unknown, creates a newDocument
(copying annotations/comments from previous doc with the sameseries_id
)
- it calls
- In this case,
annotations
(aka document "Comments" in Reader)- also calls
appeal.document_fetcher.
find_or_create_documents!
- also calls
manifestVbmsFetchedAt
: timestamp indicating when documents were fetched from VBMSmanifestVvaFetchedAt
: timestamp indicating when documents were fetched from VVA
series_id
and vbms_document_id
- A specific version of a document is referenced by its
vbms_document_id
. - All versions of the same document have the same
series_id
. - So there may be
Document
records that represent older versions of documents that are (correctly) not presented in the UI. As a result,Document.where(file_number: vet.file_number).count
is not equal todocuments.size
returned fromReader::DocumentsController
.
From Reader's VBMS integration:
Each document has a
series_id
and aversion_id
(unfortunately we refer toversion_id
asvbms_document_id
in most of the code). In VBMS a document may be uploaded with multiple versions. Each version of the document gets its ownversion_id
, but will have the sameseries_id
. Whenever we see a new document with the sameseries_id
as an existing document, we copy over all the metadata (comments, tags, etc.) we'd associated with that first document.
EfolderService.fetch_documents_for(appeal, user)
EfolderService
is a client for the eFolder service (aka Caseflow eFolder Express, not to be confused with VBMS eFolder). EfolderService.fetch_documents_for
is used by Reader to download documents from VBMS and VVA.
- First it sends a POST request to
/api/v2/manifests
(see Reader access to VBMS)- In response to the POST request, eFolder Express (specifically
Api::V2::ManifestsController#start
) creates aManifest
(and a correspondingFilesDownload
per current_user) for the Veteran. AManifest
typically has 2ManifestSource
s -- one for each of VBMS and VVA.- Schema diagram of relevant eFolder Express classes
- It starts to retrieve documents for each
ManifestSource
using a high_priorityV2::DownloadManifestJob
parameterized by thecurrent_user
.V2::DownloadManifestJob
does the following:- uses
ManifestFetcher
to fetch a list of documents for all the "file numbers" known for the veteran using BGS info. The actual document-list fetching is done by callingv2_fetch_documents_for(file_number)
onVBMSService
andVVAService
. ADocumentCreator
is used to delete and recreate allRecord
s associated with themanifest_source
, after applyingDocumentFilter
. - then it starts a low_priority
V2::SaveFilesInS3Job
to retrieve the documents' contents and store them as files in S3:manifest_source.records.each(&:fetch!)
- A
Record
corresponds to aDocument
to be retrieved byRecordFetcher
, which will fetch the contents from S3 before trying VBMS/VVA, convert images to PDF files if needed, and save to S3. - If conversion to PDF fails, the image is saved to S3 (however Reader can only show PDFs) and no alert is logged. Should investigate a solution and at least log the error when
record.conversion_status==conversion_failed
.
- A
- uses
- In response to the POST request, eFolder Express (specifically
- Once all documents for the appeal are fetched,
EfolderService
sends a GET request to/api/v2/manifests/#{manifest_id}
to return the retrieved documents.
Document View page
From Reader's Document View:
[The frontend] makes calls directly to the
/api/v2/records/:id
endpoint on eFolder Express to retrieve the content of a document. [...] the document contents should already be cached in S3.
- With each document shown to the user,
DocumentController#pdf
is called for the current, next, and previous documents. (Note this is not the sameReader::DocumentsController
used for the Document List page above.)- It serves up the pdf file from directory
/tmp/pdfs/
. The pdf could come from 3 places:Currently three levels of caching. Try to serve content from memory, then look to S3 if it's not in memory, and if it's not in S3 grab it from VBMS Log where we get the file from for now for easy verification of S3 integration.
- So if the document is not in S3 and comes from VVA, then Reader won't be able to show it. Should investigate a solution.
- Can check in Rails logs for "File #{vbms_document_id} fetched from VBMS"
- It serves up the pdf file from directory
DocumentController#mark_as_read
updatesDocumentView
records to capture when the user views the documentReader::DocumentsController#show
sets up the page contentReader::AppealController#show
returns info about the current appealMetrics::V1::HistogramController#create
sends a histogram to DataDog aboutpdf_page_render_time_in_ms
but values seem to always be 0:[{"group":"front_end","name":"pdf_page_render_time_in_ms","value":0,"app_name":"Reader","attrs":{"overscan":6,"document_type":"VA Memo","page_count":4}}, ...]
Documents cached in S3
Reader pulls document files from S3, if they're available. A RetrieveDocumentsForReaderJob caches documents in S3:
- According to serverless.yml, this job runs every 5 minutes for active Reader users.
- This job chooses up to 5 users who (1) logged in within the last week and (2) haven't used eFolder to fetch documents at all or not within the last day.
- For the Legacy and AMA appeals these users are assigned to, the job calls
appeal.document_fetcher.
find_or_create_documents! -- same as on Reader's Document List page.
Concerns:
- 5 minutes may be too frequent. Could the same 5 users be chosen by consecutive jobs if the first job is still processing? Since
efolder_documents_fetched_at
is not set until a job finishes, if the first job takes longer than 5 minutes (e.g., 1000+ documents) then the next job would pick the same users. Should investigate improvements to this. - How often is S3 used compared to document retrievals from VBMS/VVA? The intent of the job is to retrieve preferably all documents from S3. Should measure how well this job is achieving this intent and improve it, while considering S3 file auto-deletions.
When are these files auto-deleted in S3?
Asked Tango: Slack convo
Some digging reveals this:
bucket=Caseflow::S3Service.init!
client = Aws::S3::Client.new
resp=client.get_bucket_lifecycle({bucket: bucket.name})
pp resp.rules.pluck(:id,:prefix)
[["delete form 8s after 5 days", "form_8 "],
["delete documents after 5 days", "documents"]]
The earliest file in the S3 documents
folder is 5 days ago (AWS S3 web UI shows folder contents), so Reader documents are indeed deleted after 5 days.
Doc counts in Reader
In the Reader UI, document counts are displayed to the user. It can be simulated as follows:
appeal=Appeal.find_by(uuid: ...)
docs=Document.where(file_number: appeal.veteran_file_number)
# Document List page
page1resp=ExternalApi::EfolderService.document_count(appeal.veteran_file_number,user)
# Document View page
page2resp=ExternalApi::EfolderService.fetch_documents_for(appeal,user)
page2resp[:documents].size
These document counts can change over time. For example,
- 2
Document
records were created and retrieved but are no longer retrievable by eFolder, possibly because new versions are available. - eFolder has new 1
Record
that Reader doesn't yet know about, possibly because a new document was uploaded to VBMS/VVA. - The net document count change may be 1, but there are 3 differences. Should investigate a better way to track documents.
Some code for further investigation:
docs=Document.where(file_number: appeal.veteran_file_number)
vbms_idsD=docs.pluck(:vbms_document_id)
df=appeal.document_fetcher # takes many seconds to complete
df.number_of_documents
df.documents.group_by{|d| d.upload_date.beginning_of_day}.map{|k,v| [k,v.size]}.sort
df.documents.group_by{|d| d.received_at.beginning_of_day}.map{|k,v| [k,v.size]}.sort
vbms_idsR=df.documents.pluck(:vbms_document_id)
vbms_idsD - vbms_idsR
=> ["{2605FFFC-C9C7-4EF8-BAB8-E1042CB7A92F}",
"{50FB8137-8D01-431E-B71D-55F8A6BC7F09}"]
vbms_idsR - vbms_idsD
=> ["{2B507FFF-0CF2-41DA-92A4-0394D3BBF52A}"]
Doc counts in Queue page
Document counts are shown in the table on some Queue pages.
AppealsController#document_count
provides these document counts. It calls EFolderService.document_count(appeal.veteran_file_number, current_user)
, which:
- checks Rails.cache
"Efolder-document-count-#{file_number}"
- checks Rails.cache
"Efolder-document-count-bgjob-#{file_number}"
(expires_in: 15.minutes
) - starts background
FetchEfolderDocumentCountJob
, which checks Rails.cache"Efolder-document-count-#{file_number}"
(expires_in: 4.hours
) and sends GET request to/api/v2/document_counts
- In response, eFolder Express
api/v2/document_counts#index
checks its cacheveteran-doc-count-#{file_number}
(expires_in: 2.hours
) and responds withDocumentCounter.new(veteran_file_number: file_number)
- which calls
v2_fetch_documents_for(file_number)
for both VBMSService and VVAService (same asManifestFetcher
mentioned in the context of Reader's Document List page), and then returns a set ofdocument_ids
, which is counted.
- which calls
- In response, eFolder Express
Known Problems
- PDF version of TIFF from VVA not shown b/c the TIFF(not the PDF) is in S3 and cannot be immediately retrieved like a VBMS-sourced file. #14193
- Which documents fail conversion?
- Why document counts change over time? e.g., 421 + 5 more: #14289
- Why 425 vs 426? 2 gone + 1 added; VBMS's response changes over time
- 6/2/2020: Now 440.
docs.pluck(:series_id).uniq.size => 440
- Added details at #14081 Investigation Part-3
- Need to better synchronize documents with VBMS/VVA.
- Is the same job submitted within the same time span? "active user" check and limited to 5 at a time
- [Should no longer be a problem] Document count numbers are not the same in Queue and Reader (Related resolved ticket due to VBMS pagination)