Scanning the Biographies

The extraction of clergy biographies from the two volumes of Priester unter Hitlers Terror: eine Biographische und Statistische Erhebung requires the most steps for the simple reason that the dataset consists of physical books and thus requires a number of additional steps to import the dataset into a digital data structure.  The main steps required to accomplish this are as follows:

  • identifying and scanning the pages of the volumes containing the relevant biographies
  • using ABBYY FineReader to perform OCR on these scanned pages
  • using regular expressions to extract biographical information from the OCR text files in a data structure that is well-suited for being merged with the Excel file (my primary dataset)

Of course, it is instructive to understand how these steps are performed, and for this reason, I will elaborate in detail on what each of these steps entails.

As mentioned above, the first task required identifying the pages of the books that contained biographies of priests listed in my primary dataset.  Ideally, I would have been able to scan every page of both volumes of Priester unter Hitlers Terror eine Biographische und Statistische Erhebung, OCR all of the text, and then use an automated method (e.g., regular expressions or a Python script) to locate the biographies of all of the German priests present in my dataset.  However, the two volumes together consisted of 1968 pages (the overwhelming majority of which contain biographical information), making it an intractable task to scan each page by hand using a standard book scanner.  Furthermore, because these books were obtained through the Harvard Library System, it was not possible to remove the book bindings and feed the pages through an automated scanner.  This necessitated a more selective approach to selecting the pages to scan.

The biographies of German priests within the two volumes of Priester unter Hitlers Terror eine Biographische und Statistische Erhebung are organized by Bistum, which translates to diocese.  Below is the first page from the "Bistum Mainz" section, containing biographies of German priests from the Mainz Diocese:

bistum_mainz_first_page.png

The first page of the Bistum Mainz section of biographies from Priester unter Hitlers Terror eine Biographische und Statistische Erhebung.

Fortunately, the Excel spreadsheet (my primary dataset) contains diocese as a field, and therefore, I could sort the Excel spreadsheet by diocese.  This Excel spreadsheet contains 445 priests of German nationality, meaning that 445 was the theoretical upper bound on the number of biographies that could be extracted from Priester unter Hitlers Terror eine Biographische und Statistische Erhebung.    However, there were a number of limiting cases:

  • 297 of the German priests listed in the Excel spreadsheet did not have an associated Catholic diocese.  In an ideal world, the first problem would have been resolved by scanning the entire book and searching by name, but as mentioned above, this was not possible.
  • The two volumes did not contain all Bistums listed in the Excel spreadsheet.
  • In a limited number of cases, even though a priest’s Bistum was recorded in the Excel file and present in the two volumes, a biography of the specific priest was not available.

I adopted the strategy of sorting the Excel spreadsheet by Bistum, finding the Bistum in the two volumes, and then searching for all of the German priests listed for this Bistum in the Excel spreadsheet.  This method was much more efficient than blindly searching for each German priest’s name in the 101-page Personenregister (index of priests) in the back of the second volume.  A sample page of the Personenregister is pictured below:

index.png

A sample page from the Personenregister.  Note that because it is 101 pages long, it is time-consuming to search the index by eye.  With more time, it would have been possible to scan and OCR the Personenregister, search for the names of the priests who appear in the Excel spreadsheet, and then scan these pages. However, it was important that I had a digital dataset with which to work as soon as possible, and therefore, I adopted the above strategy, which resulted in scans of 64 relevant biographies after one night of scanning.

For each biography that I located, I then scanned the page or pages on which the biography appeared.  Here is a raw, uncropped scan file of a page that contains the biography of Julian Kilinski, a German priest in my primary dataset, from the Bistum Köln section

julian_kilinski.png

A raw, uncropped scan image of a page of the Bistum Köln from Priester unter Hitlers Terror eine Biographische und Statistische Erhebung.  This page contains the biography of Julian Kilinski, a German priest who also appears in my primary dataset.

In some instances, the biography was split across two pages, as shown below:

split_scan_1.png

Split biography of Theodor Brasse, page 1.

split_scan_2.png

Split biography of Theodor Brasse, page 2.

In other cases, the biographies of two relevant priests appeared on the same page.  I made sure to note the edge cases as I scanned the pages to make the data extraction from the OCR easier to manage.  

I performed my preliminary round of scanning on the night of October 5th, 2016, using a book scanner available for use in Widener Library.  Partway through, I discovered how to crop the scans such that there was no black border around each page, and consequently, approximately half of my scans were not cropped.  Fortunately, this did not present an issue for FineReader, as described in the page entitled "OCRing the Scans."  Because I used a travel drive to transfer the scans to my personal machine, I was not limited by file size and therefore opted to use the highest grayscale quality available.

The final results of my scanning are as follows:

  • Bistum Aaachen:  5 pages, 4 biographies (Hubert Berger, Theodor Brasse – split on 2 pages, Nikolaus Jansen, Gerhard Radecke)
  • Bistum Augsburg:  3 pages, 2 biographies (Johannes Burkart – split on 2 pages, Bernhard Heinzmann)
  • Bistum Berlin:  2 pages, 2 biographies (Paul Adamus, Paul Bartsch)
  • Bistum Breslau:  3 pages, 3 biographies  (Oskar Baensch, Anton Korczok, Alois Starker)
  • Bistum Freiburg:  5 pages, 5 biographies (Hermann Hahn, Oswald Haug, Albert Riesterer, Albert Trueby, Anton Spies)
  • Bistum Fulda:  3 pages, 3 biographies (Josef Albinger – split on 2 pages, Heinrich Huth, Konrad Trageser)
  • Bistum Koeln:  6 pages, 5 biographies (Hans Carls, Julian Kilinski, Anton Schwarz, Hermann Werhahn, Alois Theissen)
  • Bistum Limburg:  5 pages, 3 biographies (Jakob Bentz, Karl Michel – split on 2 pages, Wilhelm Poiess – split on 2 pages)
  • Bistum Mainz:  4 pages, 4 biographies (Jozef Adams, Karl Barth, Hans Brantzen, Adam Ott)
  • Bistum Meissen:  5 pages, 6 biographies (Otto Pies, Fritz Remy, Johannes Rothe – all 3 same page, Hermann Scheipers, Alois Scholze, Bernhard Wensch)
  • Bistum Muenster:  7 pages, 7 biographies (Heinrich Hessig, Karl Leisner, Josef Lodde, Johann Neumaier, Josef Reukes, Gerhard Storm, August Wessing)
  • Bistum Osnabrueck:  4 pages, 4 biographies (Bernhard Mecklenburg, Jakob Schmitt, Leopold Wiemker, Bernhard Wueste)
  • Bistum Paderborn:  9 pages, 9 biographies (Hans Bahrenberg, Gerhard Baumjohann, Eduard Farwer, Otto Guennewich, Karl Hoffman, Josef Pieper, Franz Riepe, Heinrich Rupieper, Hermann Vell)
  • Bistum Passau:  2 pages, 1 biography (Ludwig Braun – split on 2 pages)
  • Bistum Rottenberg:  1 page, 1 biography (Adolf Staudacher)
  • Bistum Trier:  10 pages, 7 biographies (Klemens Pereira, Johannes Ries, Arnold Schiffer, Johann Schmitt, Johann Schulz, Jakob Ziegler, Josef Zilliken)

for a total of 74 pages and 66 biographies.  After scanning the appropriate biographies from a given Bistum, I then exported the scans as a PDF file.  My file structure appeared as below:

Screen Shot 2016-10-14 at 11.00.01 PM.png

The file structure of my initial scans.