Parsing Data

This section outlines the process of building a parser through Python. By parser, I refer to a program that identifies patterns in a string of characters and extracts the data embedded in those patterns. In my project, building a parser was a crucial step in making use of the Garrison Records dataset.

The Garrison Records dataset, as provided by the Jangseogak Archives, is already marked-up, or tagged, with indices on biographical and spatial metadata. As I discussed in the dataset section, these indices, including but not limited to name, duties, buildings and geographical location, were the patterns that my parser was designed to identify and extract.

The first step towards building this parser was to generate a comprehensive list of tags from the dataset. Since I was provided with this dataset without much contextual information, this step was necessary to take full advantage of it. Thus, using a program that finds all unique tags in a XML format, I came up with the following list of tags:

After identifying these tags, a python script was written to extract the respective element of each tag, then save them into a tab-separated values (TSV) file. A TSV file was better suited for my dataset than a comma-separated values (CSV) file, given that the extracted elements were often a list of values separated by commas.

Here is my python script, embedded as a Jupyter notebook:

While extracting the elements for simple tags such as <duties></duties> or <geo></geo> was relatively easy, there were unforeseen difficulties with tags containing attributes such as different types of notes: <note type2=""author""></note> and <note typepos=""marginal””></note>.

The issue was not only that there were different types of tags, but that there were multiple layers of nested tags. For the moment, due to limited time and the need for a better-designed database schema for the nested tags, I have decided to extract only the outermost layer of these nested tags.

Regarding the issue of different types of notes, I made the decision to implement more of a hard-coded method. This is because the comprehensive list of tags was relatively short, and thus a more elegant script that automates the entire process of finding the unique tags and then extracting the elements from these tags was unnecessary.