Data Extraction

Screen Shot 2016-10-23 at 8.37.07 PM.png

Harvard Library accesibility to David Barnes book

The data extraction of the David Barnes book was the first step in creating a viable dataset for my research. As I layed out in my project proposal, I was looking to extract the text from this book and hopefully run specific tests on the text/data. Fortnunately, my initial first step was shortened. David Barnes book was already digitized and placed on Google Book's website. Harvard's online library contained a link that brings searchers to a scanned copy of all the pages of "The Draft Riots in New York".

Google created a free ebook that could be downloaded from their website. With this step already done for me, I than began to make the text more manipulable for the future analysis I would like to perform. While the scanned copies of each page are helpful, this format is not the format I need to extract specific text from the book. My initial thoughts for the next step, were the need to OCR the digitized copy. Overall, this book/dataset was extremely manuverable because my first step in the data extraction was completed already.

Screen Shot 2016-10-23 at 8.13.45 PM.png

Google Books already digitized the book (fastforwarding my process)