I have been playing with the National Bridge Inventory (NBI) data over the past few weeks and discovered a few things about the data. It all started when I imported the Texas data. I realized that there were a number of locations with two map markers. After some digging, I discovered that there are many cases of duplicated data, but only a few fields were duplicated. For example, the inspectors may have given multiple ID numbers for different aspects of a crossing, but utilized the same geographic coordinates. These duplications do provide more comprehensive information, but they only confuse the issue when we are attempting to build a general overview.
Eliminating these duplicates and cleaning out partial records are fairly simple database tasks. But I have been working on learning Python, so I thought this would be a great opportunity to learn a few new skills. I wrote a small program which opens the source data file, sifts through to remove duplicates and empty data strings, then saves the results to a new file. The program also changes the source data from fixed-width to comma delimited to better support importing the data into Google fusion tables.
I plan to tweak the program some to improve the data conversion and start posting other states as I get them processed. I also want to build in-frame legends. For now, take a look at the Texas data below (subject to revision as time allows).
The logic for the processing follows:
Open the control file (this file contains the field names and the character positions, i.e. Latitude, 341:348)
Read the field names from the control file and copy them to an array
Ask the user which field names to import and update the array with only those to keep. Need to keep both the name and field character positions.
Create a new data file for the output data
Open the data file
Loop through each line of the data file
For each line in the data file, loop through the control array and verify each piece of data. Some things to remember:
Latitudes and Longitudes in these files are provided in DMS format and need to be converted to decimal degrees.
Duplicate positions are not wanted – keep a copy of each position for comparison to the next. Throw away any duplicates. You can also do this later by looking for and keeping only unique values.
Some of the lines have empty data strings (especially the rating values). These may be either discarded or marked with a special character (i.e. -999) so they are easily recognized later. Remember that zero has a meaning in this data.
Build up a temporary string with each field read.
Write the temporary string to the output file.
Loop to the next line in the data file.
I set up my program to loop through and convert every data file in a specified folder. Texas and Louisiana each took about an hour to process. I haven’t written the program to run the spatial join to add the county data to each file. That comes next.
When I have a good weekend where I don’t have to work on other projects, I’ll set up another computer to crank through the rest of the states.