Process and known limits - PaulBreugnot/TheMaterialParser GitHub Wiki

TheMaterialParser allows you to semi-automatize the extraction and standardization process of material compositions. But the tool still need to be enhanced a lot, as the parsing process.

Extraction process

As discribed in the manual, TheMaterialParser allows you to perform multiple selections on several datasheets to try to extract compositions.

More precisely, for each datasheet :

The Tabula BasicExtractionAlgorithm is applied with all the selections.
For each parsed .csv, our algorithm try to parse components for each entry (vertical or horizontal, depending on the specified orientation).
A composition is considered as found if all the potential components are valid. If multiple selections produce valid compositions, the one with the most valid components is kept.

The code corresponding to this process can be found in the processSelection function of datasheet_process_controller.rb.

Component validation

For each selection, once a .csv has been extracted by Tabula, we do the following :

1. Parse name

Try to parse the component name. To do so, we find all words in the potential name string using the /\w+/ regular expression.

For each matching word, we look if it can be a periodic element full name OR a symbol looking on the periodic elements table. For now, only English names are supported.

This also represent a first name standardization of the parsed data. Each component is actually represented as a Ruby object, with a full name and symbol field.

Examples :

"Manganese....................................." => Mn
"Silicon (Si)" => Si

2. Parse value

If the name is valid, check if a value field is present, because sometimes only the name column / line can be extracted, but without the value fields. From there, the component is considered as valid.
The component is considered as balance if /(Bal)|(bal)|(Base)|(base)|>|≥/ matches.
The component is considered as residual if /≤|</ matches.
The value.s is / are extracted scanning the string with /\d+.\d+/.
If two values match, the component is considered as a range. value becomes the mean of the two values, and matching values are saved as minValue and maxValue.

Examples :

Simple

Original .csv fields : Carbon (C),0.03
Parsed component :

Name : C
Value : 0.03
Range : false
MinValue : 
MaxValue : 
Balance : false
Residual : false

With range

Original .csv fields : Chromium (Cr),17.50 - 19.50
Parsed component :

Name : Cr
Value : 18.5
Range : true
MinValue : 17.5
MaxValue : 19.5
Balance : false
Residual : false

Residual

Original .csv fields : Iron (Fe),Balance
Parsed component :

Name : Fe
Value : 
Range : 
MinValue : 
MaxValue : 
Balance : true
Residual : false

With unknown characters

Original .csv fields : Sulphur (S),0.015b)
Parsed component :

Name : S
Value : 0.015
Range : false
MinValue : 
MaxValue : 
Balance : false
Residual : false

Residual with value

Original .csv fields : C, ≤ 0.030
Parsed component :

Name : C
Value : 0.03
Range : false
MinValue : 
MaxValue : 
Balance : false
Residual : true

Find more details about composition parsing in composition.rb

Known limits

Balance and residual

Other characters and words can probably be found to specify a component in balanced or a residual, that are not handled there. However, this can simply be fixed adding them to the corresponding regular expressions.

Valid but incomplete

When selections are applied to others datasheets, they can be detected as valid, but components can actually be missing.

Example :

IncompleteSelectionExample

For now, the only solution to this problem is to manually check selections / extracted data from those datasheets, even if they are check as parsed.

Complex tables

Currently, more complex tables are not handled.

Example :

ComplexTable