Data Manipulation

While I had the text in a manipuable format, the xml files that contained the text of the book, the files also contained a large amount of html and file formatting text. My next step was to clean the data so that all that the files contained were the actual text from the book. The two images below displays the content of one xml file before and after I removed all the unnecesary content

Screen Shot 2016-10-23 at 9.31.19 PM.png

Before html was removed from content0011.xml

Screen Shot 2016-10-23 at 9.30.23 PM.png

After hmtl was removed from content0011.xml

All of this changing and removing of the xml files was done through the use of regular expressions within Atom. This was the first step within my dataset that I had to change the existing data. Up until this point, Google Books and Harvard Library had digitized the data to the exact purposes I needed. The removal of the html was the first reorganization I needed to complete within my research.

The reorganization was more of a removal of unneccesary content. The data was signicficantly clean for the use of OCR. This was yet another fortnuate aspect of this particular dataset. Only small issues within the text were found throughout all of the xml files. The majority of the work could was done through a few regular expression lines which I have listed below.

<[a-z 0-9=\'\.\:\;\_\-\%]+>

</span>

</div>

When I searched for these inputs throughout the texts, I then replaced them with nothing and all that was left was the actual text from David Barnes. To consolidate the paragraph breaks and make the data be more readable, I replaced all </p> with another return line. Once these functions were performed on all of the xml files, all that was left was the anomolies within the text.

These anomolies within the text were an assortment of characters ranging from quotation marks to dashes. Each character maintained its own code thus each could be changed within the files in one simple step of replacing all. Finally when all of the files were completely cleaned, I combined them all into a single file. This will be the file that I will continue to use throughout my research. The next steps will be to anaylize the data and reorganize again if neccesary.