Working with Data

Overview

This section outlines the process of acquiring and manipulating the Garrison Records dataset. 

First, the data acquisition process was a combination of two codependent processes:

  1. Scraping data 
  2. Soliciting data

Scraping data entailed exploring different web-scraping softwares and online tools such as Outwit Hub and Dexi. Soliciting data was enabled by a statue by the South Korean government that allows the provision and use of public data held by Korean institutions. While the former was successful in building a sustainable way to scrape and update my database, the latter provided an immediate solution and provided the dataset currently used for this project.

Second, after acquiring the data, I then proceeded with two different types of data manipulation. 

  1. Cleaning data
  2. Parsing data

As detailed in the section on cleaning data, the provided data came in a fomat that was all but easy to work with: XML embedded in Excel. Thus, a program was coded to remove all XML tags and clean the dataset. In another method of data manipulation, rather than removing the XML tags, a parser was built to extract tagged elements and store them as files for the relational database.