Parsing the OCR Text

I decided that the best approach here would be to use regular expressions to parse the .txt files. In particular, I began by merging all files of the form bistum_*_single_file.txt into a single .txt file. After copying all of the relevant files into a new folder, I merged the files by using the command "cat" from the command line, as seen below:

This command created a merged .txt file called: all_biographies_single_file.txt. I performed all of my regular expression editing in Atom. As I experimented with regular expressions, I noticed two instances of blatant OCR errors that I fixed in the body of the text file directly, so that I would not have to rely on regular expressions to fix them.

We begin with names of the form:

KORNEK, ERNST (P. VINZENZ)

OFM

where the second line is an additional optional information field that I have chosen to delete, in order to recover the rest of the biographies (these consitute < 5% of the biographies present in the text file). We remove the lower line (in the above example, "OFM", and then add quotations around the name), using the following two find & replace regular expressions, where the top line is the Find regular expression, and the bottom line is the Replace regular expression:

\n([A-Z]{2,}, [A-Z]{2,} [A-Z]{2,} \(.*\))\s*[A-Z]*

\n"$1",

and

\n([A-Z]{2,}, [A-Z]{2,} \(.*\))\s*[A-Z]*

\n"$1",

Next, we can target all names of the form: SAFT, P. PAUL FRANZ SJ using the regular expressions:

\n([A-Z]{2,}), ([A-Z]{1,}\.+)( [A-Z]{2,})( [A-Z]{2,})( [A-Z]{2,})

\n"$1, $2$3$4$5",

Next, we target all names of the form: HERRMANN, P. CAMILLUS MSC using the regular expressions:

\n([A-Z]{2,}), ([A-Z]{1,}\.+)( [A-Z]{2,})( [A-Z]{2,})

\n"$1, $2$3$4",

Next, we target all names of the form: BAUER, P. GEORG using the regular expressions:

\n([A-Z]{2,}), ([A-Z]{1,}\.+)( [A-Z]{2,})

\n"$1, $2$3",

Note that it is important to target the names decreasing in word number, in order to prevent a substring of a larger name.

Next, we target all names of the form: THIERY, KARL THEODOR HUBERT GUSTAV:

\n([A-Z]{2,}), ([A-Z]{2,})( [A-Z]{2,})( [A-Z]{2,})( [A-Z]{2,})

\n"$1, $2$3$4$5",

Next, target all names of the form: THOENE, FRANZ XAVER ANTON:

\n([A-Z]{2,}), ([A-Z]{2,})( [A-Z]{2,})( [A-Z]{2,})

\n"$1, $2$3$4",

Next, target all names of the form: THELEN, PETER HUBERT:

\n([A-Z]{2,}), ([A-Z]{2,})( [A-Z]{2,})

\n"$1, $2$3",

Next, we target all names of the form: THOME, JOHANNES:

\n([A-Z]{2,}), ([A-Z]{2,})

\n"$1, $2",

At this point, it is necessary to edit the two entries for Bernhard Wessel and Joseph Brandenburg by hand to include their additional middle names in appropriate locations – they had initially been placed on new lines by OCR.

Lastly, we eliminate name edge cases remaining with extra field:

",\s*\n*[A-Z]{2,}

", (note that there is a space at the end of this expression)

We are now ready to target birthdates, putting them in the form: "YYYY", "MM", "DD":

",\s*\n*([0-9]{4,})\s([0-9]{2,})\s([0-9]{2,})

", "$1", "$2", "$3",

Next, we must target two edge cases where the dates contained only a year, in which case we assign "MM" = "DD" = "00":

",\s*\n*([0-9]{2,})

", "$1", "00", "00",

Next, we target the beginning of the rest of the biography and add a quote, in order to capture the remaining portion of the biography in quotes:

("[0-9]{2}", "[0-9]{2}",)\s*

$1 "

We now target the end of the rest of the biography:

\s*\n*("[A-Z]{2,})

"\n$1

At this point, we then delete the top of the file (as it contains the latter portion of a biography separated on two pages), and add a single quotation mark at the very end of the file. In order to make this into a CSV file, all that remains is to:

1) remove all carriage returns:

 (.)\n(.)

$1 $2

2) place carriage returns only at beginnings of new biography entries:

("[A-Z]{2,})

\n$1

Exporting this as a CSV file and opening the file in Excel, we are left with a table with 563 biographical entries:

The initial CSV file produced using regular expressions. The names, as well as year, month, and day of birthdates are isolated as separate fields, with the rest of each of the biographies being stored as a long string.

Out of the 563 entries in the CSV file, 558 are in the desired format. The 5 entries not in the correct format are all edge cases, such as biographies with missing birthdates. Given that these edge cases constitute such a small percentage of the CSV file, the file is ready for the next step. Here, the next step is to target the 66 biographies of interest by creating relationships between this CSV file and the other CSV file.

We begin by importing this dataset into FileMaker Pro. First, it is worth noting that editing the file in Excel and then importing the CSV file in FileMaker Pro corrupts the file. Therefore, it is advantageous to work directly in FileMaker Pro (Excel corrupts the text encoding of non-standard characters when moving between the Windows ANSI and UTF-8 encodings). Creating and populating a new field with a unique ID number is described in the below link:

http://help.filemaker.com/app/answers/detail/a_id/9903/~/how-to-populate-a-field-with-a-unique-id

After populating a new field with a unique ID, we are left with the following database in Filemaker Pro:

We then export the "ID," "Name," "Birth Day," "Birth Month," "Birth Year," and "Rest of Biography" fields to a new CSV file. As mentioned above, note that a handful of entries contain more fields than these desired ones, but these entries consitute <1% of the database; for this reason, we proceed as if each entry is in the correct form. This CSV file is now ready to be compared to the original Excel file. I describe this process in the Linking the Datasets page.