Domain Layers: Business Logic - i-on-project/integration GitHub Wiki

The Domain Layer contains all business logic about how to interpret the raw data that is obtained from source files and then transforming it into a coherent domain object and, finally, generating the appropriate Data Transfer Object (DTOs) for output processing. This module is further divided by each currently supported job type (class timetables, academic calendar, and evaluation schedule) as each of these requires specific parsing logic and object construction.

PDF Data Extraction

PDF data extraction is done using iText and Tabula. This activity is triggered in the job’s extractPDF step. iText is used through the call to the extract method in the ITextPdfExtractor class and Tabula by invoking a custom extractor implemented through an object declaration. While iText output is in plain text, Tabula fetches the tables in the PDF maintaining the column and row format and returns an array of JSON objects representing a table. Each table will have the following name/value pairs:

{
   "extraction_method": "lattice",
   "top": 300.81226,
   "left": 56.926014,
   "width": 479.52911376953125,
   "height": 183.77505493164062,
   "right": 536.45514,
   "bottom": 484.5873,
   "data": []
}

The extraction_method details which mode was used to retrieve the tables. Tabula allows for two approaches:

Lattice mode is to be used if there are ruling lines each cell, as in a PDF of an Excel spreadsheet.
Stream is the suggested mode if there are no ruling lines separating each cell. The top, left, width, height, right, and bottom refer to the table dimensions. The last pair with the name data will be an array of with the table rows, each one an array of cells:

[
   {
      "top": 300.81226,
      "left": 56.926014,
      "width": 247.92564392089844,
      "height": 15.077880859375,
      "text": "Divulgação de horários"
   },
   {
      "top": 300.81226,
      "left": 304.85165,
      "width": 231.60348510742188,
      "height": 15.077880859375,
      "text": "9 de setembro de 2020"
   }
]

Each cell with have the dimensions described in top, left, width, and height. The final pair is the text content for the cell. Any cell with more than one line will have the text value containing the carriage return escape character \r, which will have to be trimmed later. The information extracted will be available as raw data, with a specific class per job type.

Business Objects

The Business Objects (BOs) are obtained through the raw data described previously. These contain the business logic and different types of data processed have different objects. Except for timetable, which had the internal model defined in i-on Integration 2020, the academic calendar and evaluations have BOs closely mapped to the output formats described next.

After extraction from the PDF, the resulting raw data will contain text and a JSON object representing a table. The next transformation will take place in the Create Business Objects tasklet. Business Objects are internal representations of parsed data, distinct from external representations and formats which are modeled with the use of DTOs.

Timetables

Parsing ISEL timetable data was already a supported feature on the first iteration of the Integration project, and all table data was extracted by processing each page in its entirety through the Tabula library. Adapting the parser to the 2020-2021's timetable format proved challenging due to an unforeseen error that caused all instructor data, extracted from the lower half of the sheet, to be unordered. The exact cause is unclear, but it was possible to determine that the instructor data section was being interpreted incorrectly. Instead of the expected output consisting of an ordered set of rows, from top to bottom, and each row containing an ordered set of columns, from left to right as shown in below, the parsed result contained a set of cells ordered in a zig-zag pattern.

Expected parsing:

Expected instructor parsing

Incorrect parsing obtained:

Incorrect instructor parsing

Tabula’s API supports a set of parameters in its extract method that can be used to mark a specific area of the file for parsing. By making use of this feature the area containing instructor data, as seen in te figure below, was parsed separately to obtain properly structured data without changing the result format.

Timetable parsing regions

Albeit counterintuitive the only working solution found was to change the extraction_method parameter into stream which, despite being more appropriate for cells without ruling lines, managed to parse the instructor data regions correctly. Individual cell height, in the timetable region, also differs slightly from last year’s, increasing from about 15 points per cell to approximately 20. Each of these cells represent a specific 30-minute timeslot (i.e., from 8:30 to 9:00). This change is relevant due to the way class duration is calculated, which is based on aggregate cell height.

Cell height diagram

Other necessary changes include adapting Regular Expressions (Regex) clauses to match this year’s file formats while maintaining compatibility with previous formats.

Academic Calendar

The file containing the Academic Calendar 2020/2021 for ISEL is composed of three tables, one for the winter term and two for the summer term, for the latter each table is in a different page of the PDF. The transformation into business objects start with the invocation of the static factory method from in the AcademicCalendar data class. Throughout the installation of the object Regex is used to parse from RawCalendarData the various parts that compose the calendar:

Calendar Term, which states the academic year plus the calendar term.
Interruptions, that indicates the holiday and interruption dates.
Evaluations states the different exam periods.
Lectures provides with the dates for the start and end of the lectures, associated with the curricular term. Different curricular terms may have different starting dates.
Other Events provides other dates such as timetable release dates, project final delivery date, and others.

To support the different date formats available in the raw data a DateUtils object was created to support the conversions using the Locale class from java.util. So, when parsing a string to convert it to a date the Portuguese language will be supported. Locale definition follows IETF (Internet Engineering Task Force) BCP (Best Current Practice) 47 Language Tags, which enable the usage of tags to indicate the language that is going to be used. For this calendar, we use “pt”. This approach allows the use of other languages in the future, if needed. The date parsing required the use of Regex to determine if the string was a single date or a date range, and in that case if was spanning several years or months. As an example of the calendar we parsed, in Portuguese, the different possibilities in date format could be:

9 de setembro de 2020
1 de dezembro de 2020 a 3 de janeiro de 2021
15 e 16 de fevereiro de 2021
17 de fevereiro a 2 de março de 2021

If initially we considered Date class as the standard for all the dates that were obtained, we understood that Date is obsolete and should not be used. This is because Date is mutable, so it is possible to circumvent an invariant of a piece of code using Date. As such, we decided to use LocalDate and ZonedDateTime, the first as a standard date representation without time-zone, the later for the timestamp used in the JSON and YAML output files with the PDF creation date time and retrieval date time.

Exam Schedule

The exam schedule file for the scoped institution for the current calendar term is a multi-page PDF with three different tables. Due to the complexity of the implementation of the exam schedule we decided, in alignment with our supervisor, to focus on the exams and not include tests, nor the locations of the exams. The file is a PDF originating from Google Sheets, which kept the table structure valid and thus enabling the reading of its contents by Tabula. If the production of the PDF file would be as similar as last year, we would not be so fortunate, as the PDF was a scanned document, which would require Optical Character Recognition (OCR) for the text to be extracted. As described earlier, Tabula produces a JSON object, that it is converted when the business objects are built, to a Table object, each line an array of Cell. As in the academic calendar, the from static factory method in the Evaluations data class triggers the construction of the Business Object from the RawEvaluationsData. Each line iteration checks what type of exams the course has in the table and adds an exam to a list. By taking consideration if the course is a Winter or a Summer course it will fetch the data from the appropriated columns, which then is used to create and add the exam events to the exam list in the Evaluations Business Object.

Data Transfer Objects

Data Transfer Objects are used to transfer data between layers and are produced from the Business Objects using static factory methods in the DTOs data classes. If the implementation for the academic calendar and the evaluations the implementation was simple, for the timetable the construction of the DTOs required more effort since the respective business objects were defined to be used with i-on Core 2020 Write API as shown below:

Timetable business objects

Using Kotlin’s Collection operations the result is a TimetableDto that mimics the output format described in the Integration Data Model section. These DTOs will be wrapped in the ParsedData object used by the Dispatcher described in the Application Layer section and used to produce the output file that is going to be saved in the file repository.