Excel File

My primary dataset is a Microsoft Excel spreadsheet containing information on 2679 clergy members sent to Dachau Concentration Camp during the Holocaust.  This dataset was kindly provided to me by Dr. Robert Ehrenreich, the Director of University Programs at the United States Holocaust Memorial Museum.  This Excel spreadsheet contains a number of fields with information on the clergy, including:

  • First name and last name
  • ITS reference number
  • Title
  • Prisoner number
  • Date of birth
  • Birthplace
  • Nationality
  • Religion
  • Occupation
  • Catholic diocese
  • Date of arrest
  • Locations and dates of imprisonment (such as if an individual was transferred to another camp, etc.)
  • ultimate fate (murdered or survived)

Unlike my other datasets, this dataset was provided to me in a form that did not require scanning, OCRing, or web scraping.  It did, however, require a brief foray into text encoding because a number of non-standard characters, such as accented letters, were corrupted.  Based on my research into this, my best guess is that at some point, the UTF-8 encoding of the file was corrupted, causing character corruptions such as:

é -> Ã©

This necessitated a detour into UTF-8 file corruption.  After a number of attempts to convert the file via suggestions on Stack Overflow, I attempted to convert the file manually in Python by scraping the table on the following webpage:

http://www.i18nqa.com/debug/utf8-debug.html

using the Chrome Scraper extension and then using dictionaries in Python to automate the replacements.  However, Python did not behave correctly in importing the CSV file.  I discovered that the Python CSV package requires the file to be exported from Excel as a "Windows Comma Separated Values" file:

http://stackoverflow.com/questions/17770727/new-line-character-seen-in-unquoted-field
 
Exporting the file from Excel this way magically reversed the file corruption, leaving me with a CSV file ready to be linked to the other datasets. For a description of how I linked the datasets, please refer to the page Linking the Datasets after reading the pages on my other datasets describing how I have processed them.