Using the IRSAAL tool - irsaal/urdu GitHub Wiki

The tool is divided into six component parts and is accompanied by an introduction and formula list.

The six parts include: a clean corpus sheet, a corpus analyser, a word bank, a regular expression based transformer for normalizing roman Urdu words, a Roman to Perso-Arabic script converter, and a lookup table.

Part 1: Collection

Input your cleaned corpus that was produced with the data collection workflow outlined in One Way to Clean or another manner to your liking. (Our sample dataset has 429 unique tweets). Compile these into the sheet in the tool by pasting from the previous workspace.

Part 2: Preliminary corpus analysis

This second component involves primarily “getting to know your corpus” so that you can edit other elements of the tool in an effort to increase its accuracy and contribute to the wider word banking project.

We used Voyant as a simple way to split tweets into individual strings and provide counts of words occurring in the entire corpus of tweets. We could have used Sheets to accomplish this, but Voyant was both simple and let us visualise other dimensions of the corpus which would not have been possible with simple google functions.

At this stage, our corpus contained 994 strings. (Voyant potentially limits the number of words output and we should compare with another technique)

In addition to this word frequency tabulation, once the data was back in sheets, we calculated word lengths so that we could sort by both length and frequency. This was aimed at our proposed informal sampling method which seeks to ensure accuracy among various types of words in a given corpus.

Finally, we also tagged words for inclusion in the word bank, so that we could sort by tag and add to the word bank. Our basis of tagging these words was primarily letters which we new the transformer would not be able to accommodate - essentially any roman character which could correspond to multiple Urdu letters or that typically take multiple letters to convey in distinctively in roman script , e.g., usually the letters from arabic: zal, zuad, zay, zoy or ghay and khay. We accomplished this tagging process by using a binary code (i.e. 0,1) which you are then able to sort in order to copy and paste into the Word Bank sheet.

Finally, the tool provides a match formula to give users an easy view of if their selected words for the word bank were already included (True) or if they should move ahead and add to new words to the word bank.

Part 3: Word Bank

The aim of the word bank is to accommodate words that would not successfully be transformed using distinctive rules in the transformer. We established criteria for these additions initially by manually looking for words in the corpus analyser based on known difficulties in working with certain letters and observations from working in social media spaces, particularly for shortened abbreviations. As we worked an evaluated the success of transformations, we also returned to add additional high-frequency, poorly transformed words to the word bank.

You are able to pull all word bank words from the corpus analyser by sorting for the tagged word bank words and copying into the word bank sheet.

Next you should manually populate all variant spellings of words based on further working through the corpus analyser, for example scanning for possible variants of ‘hun’ (e.g., 'hu' and 'hoon'). This is a methodology which might require greater precision for larger scale corpuses, but also remains a somewhat intuitive enterprise.

You are then able to produce intermediaries for variants referencing the Urdu lookup table.

We provide a formula to verify the accuracy of these intermediaries by importing the script converter into this sheet and comparing the converted output with a column of manually input Urdu words in the Perso-Arabic script.

Word bank words

In total, the initial project example produced nearly 1000 unique word entries, and just over 400 total variants in the word bank - as compared to the 1000 ‘most common words’ that were manually coded in the Sharf and ur-Rahman project. These accounted for around 21% of the total number of accurate results in the tool.

Part 4: Regular Expression Transformer

The regular expression (REGEX) transformer is the central engine for normalization transformations. It accounts for almost 70% of the total accurate results in the tool, and on it’s own produced accurate results for 44% of the corpus. It comprises a series of transformations using regex formulas towards an intermediary, standard romanization, which can then be used of analysis on a standardized roman Urdu corpus, or which can be converted into the Perso-Arabic script for comparison with the much larger dataset of data in that script.

The sheet imports all words from the corpus. For the purposes of testing and the example data subset, however, we used an informal sampling technique to select high frequency words, words with high numbers of variants, and long words to ensure that this component could handle various inputs.

The transformer encompasses 14 unique regex equations and two duplicate equations that tackle doubled vowels and occurrences of medial e’s The final formula in the sheet checks to see if the word from the corpus appears in the word bank, and if it does, it yields the word bank transformation, instead of the transformer’s output. Otherwise, the transformed romanization is produced.

Part 5: Script Converter

This component draws from the final list of transformed words in order to produce an analogue word in the perso-Arabic script. The converter takes in the list of words that are in the final (right-most) list in the regex transformer sheet and parses the word into single characters.

Then, using the cipher in the lookup table sheet, the converter uses the a 1:1 reference between the split letters in the intermediary roman urdu transliteration and the corresponding Perso-Arabic letter in the look-up table.

The sheet then recompiles the word, concatenating the individual Perso-Arabic letters.

In the example spreadsheet, we manually coded the sample of words to evaluate the effectiveness of the tool. We omitted English words from the count by tagging them “4”, identified dictionary accurate conversions by tagging them “1”, words that were correct by virtue of the word bank by tagging them “2”, and finally words which were largely intelligible due to 3 common variants (nun vs. nun ghuna; choti hai vs. do chashmi hai; and choti ye vs. bare ye) by tagging them “3” which we determined due to historical typographic and orthographic variation, were largely intelligible to most readers.

Some errors which may be intelligible to many readers, especially the presence of long vowels which should not be present, were excluded due to their very high frequency and need for more systematic solution.

Overall the converter produced results that were 64.6% accurate for the sample word selection from this corpus.

Part 6: Script index/cipher/lookup table

The final component is a static lookup table which serves as the cipher for the converter functions in the previous component (as well as the test feature in the word bank).

The lookup table is based initially on the US Library of Congress (LoC) romanization scheme, with changes to ensure the 1:1 relationship between letters. Several unicode characters with macrons and double dots below did not render in Google sheets, and we consulted the Wiki page for Persian Romanization and The Brill Typeface User Guide & Complete List of Characters to complete the table.