Data is a pain. There's a ton of it; it's in a billion different formats, and trying to make it behave the way you want so you can find patterns or answers to questions is difficult. But it's still important to figure out how to manage data sets because data analysis helps companies in just about every industry across the globe analyze the past to predict the future and get a leg up on competitors. Data is the “thing” right now.

Of course, that's a very broad interpretation of what one does with large data sets. The reality is that most of us working with data on a daily basis are mired in very specific areas, and we have specific questions to answer. This often means that we start with massive amounts of information and we need simple ways to trim it all back so we can focus on managing only the stuff we need. Additionally, we need to make sure we're using repeatable processes — just like every other science out there. If you can pull something off once, it's not science. You have to be able to replicate what you find, over and over.

I think one of the best ways to handle these challenges is by using Docker. I know you're probably thinking Docker is good for building web apps, but managing data sets? How would that work?

Trust me — it's pretty cool.

In this two-part series, I'm going to present an example that will show you why Docker is great in this kind of scenario. The first post will show you how to get your data ready and into a database (we'll use MongoDB — it's awesome at geospatial references), and how to start your basic analysis. The second post will show you how to visualize the data analysis you perform and how you can then interact with your data in a more intuitive and useful way.

For the purposes of this example, I needed to pick a data set that wasn't ridiculously large, but big enough. It needed to be free and open to the public (because, well, obviously). And, I wanted the data to lead us to ask natural questions, something everyone understands to a point.

I settled on tornadoes.

To be specific, I took a data set from Census.gov outlining all county borders in the United States, and another set of data from NOAA.gov covering all tornadoes recorded in the United States from 1950 to the present. My objective was this: I wanted to end up isolating all tornadoes that happened in my home county, Tarrant County — cute little spot on the edge of Tornado Alley in the Dallas/Ft. Worth area of Texas) and all tornadoes that happened within 100 miles of the center of Tarrant County. This seemed like a good way to start with two sets of unrelated data and find a way to narrow them meaningfully, so both would ultimately contribute to answering the same question.

Let's get started.

Setting Up The Environment

While Shapefiles are the gold standard for obtaining government map data, you have to convert them to either GeoJSON or KML formats in order for them to be usable. And while you have a few options for converting them, the typical method (directly downloading and compiling the Geospatial Data Abstraction Library (GDAL) is one I find frustrating and overly-complex. Instead, I prefer using a Docker container to deal with the standard tools. Docker is efficient and self-documenting. As an added bonus, if you are using geocoded information for data science, Docker gives you an easy way to reproduce your environment and tidy up data when you're done.

The main tools we will need are:

  • GDAL for conversion
  • mapshaper for simplification
  • jq for processing JSON data
  • (And, of course, Docker. If you don't have Docker installed already, you can find instructions here.)

  • Let's start by cloning (or forking) my GitHub Dockerfile repo and building a couple of Docker images.

    $ git clone https://github.com/mclose/Dockerfiles.git
    $ cd Dockerfiles
    $ docker build -t gdal gdal
    $ docker build -t mapshaper mapshaper
    $ docker build -t jq jq
    

    The -t flag is the tag used to name the image and the final parameter to build is the directory where the Dockerfile is located. The build for GDAL is a somewhat lengthy process, so you can plan on it taking about 10 minutes to complete. (Comparatively, if you were creating the Dockerfile from scratch, it would take a bit longer — you're capitalizing off the work I've already done here. And if you were going about this the old-fashioned way, you might as well set aside at least a day, or more.)

  • Let's verify that it worked. The output of docker images should now look something like this:

    $ docker images
    REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
    jq                  latest              972bde3fb8da        2 seconds ago       10.23 MB
    mapshaper           latest              1b59b6a6f9b3        19 seconds ago      652.7 MB
    gdal                latest              1b657b25df39        2 minutes ago       1.683 GB
    node                4.4.4               1a93433cee73        2 weeks ago         647 MB
    alpine              latest              13e1761bf172        2 weeks ago         4.797 M
    
  • The last piece of the puzzle is to make sure we can use our tools from the command line as if they were any other binary installed on our system. In my Dockerfiles repo, there is a file called all_aliases. By sourcing this within a command line shell, you will have access to all the Dockerized geocoding resources as if they were any other tool on the CLI.

    $ source all_alises
    

    Seem too easy? It should — and that's a good thing. Your environment is now ready to convert and process Shapefiles. Let's start working with some data.

Gather Your Data

Normally, the next step would be to gather all your data sources for processing. For this tutorial, the easiest route is to clone or fork the tornadoes-gis repo to get all the data and code samples I've used. If you want to download the data sets directly, I've included my sources at the end of this post.

$ git clone https://github.com/mclose/tornadoes-gis.git
$ cd tornadeos-gis

Convert the Shapefiles to GeoJSON

  1. We can now interact with some of our Shapefiles directly. Let's take a look at the Shapefile with county boundaries specifically.

    $ mkdir counties
    $ cd counties
    $ unzip ../source_data/cb_2015_us_county_500k.zip
    $ ogrinfo -al -so cb_2015_us_county_500k.shp
    
  2. The last command, ogrinfo, is running in a Docker container and will give us a summary of what is in the file. Your output should look something like this:

    INFO: Open of `cb_2015_us_county_500k.shp'
         using driver `ESRI Shapefile' successful.
    [...]
    GEOGCS["GCS_North_American_1983",
       DATUM["North_American_Datum_1983",
           SPHEROID["GRS_1980",6378137,298.257222101]],
       PRIMEM["Greenwich",0],
       UNIT["Degree",0.017453292519943295]]
    STATEFP: String (2.0)
    [...]
    

    I'm emphasizing some of the the more important output with Shapefiles. The GEOGCS portion of the output describes the coordinate system or spatial reference system that is used in the Shapefile. For the most part, we don't need to worry about the coordinate system of the source file. However, we do need to worry about converting it into something we can use. The coordinate system for GeoJSON, our goal, is WGS84.

  3. So our next step is to change the Shapefile coordinate system and format. We'll do this in one step and again use a Dockerized command.

    $ ogr2ogr -f GeoJSON -t_srs EPSG:4326 us_counties.json cb_2015_us_county_500k.shp
    
  4. ogr2ogr is a very flexible command and can do many things, but it can be a little complicated. Let's take a second to run through what is going on above in detail.

    • Thr -f indicates the output format we want, GeoJSON.
    • The argument -t_srs is the target spatial reference system that you want to convert the source file to.
    • While you will usually see WGS84, ogr2ogr uses special codes for the reference system which can be looked up here.
    • The code for WGS84 is EPSG:4326.
    • And in another twist on how we usually do things, the output, us_counties.json, is listed before the source, cb_2015_us_county_500k.shp.
  5. Now that we have the US county boundaries converted to GeoJSON, we need to tidy up the file a bit. Since we plan on using MongoDB, we need to make sure each GeoJSON feature is smaller than 16MB or we will exceed the per document size limitation. There are a number of ways we could get this done. Two simple ones would be either reducing the number of coordinates inside a county boundary or reducing the precision of the coordinates. In either case, our Dockerized tool mapshaper is the right choice for the job.

    For my example, I'm simply going to reduce the precision of the coordinates. So a coordinate like [-86.12254952, 31.6125044] will become [-86.125, 31.615]. If you're worried about loss of accuracy, the change isn't very significant when you're working with objects the size of a US county. The worst case scenario for a continental US county is that the new coordinate will be a quarter mile away from the actual one. For our purposes, that's just fine. Here's the command:

    $ mapshaper -i us_counties.json -o us_counties_mapshaper.json precision=0.005
    
  6. We have one last step before we can import the GeoJSON file into MongoDB. Normal GeoJSON format looks like this:

    { "type": "FeatureCollection",
     "features": [
       { county1 },
       { county2 }
     ]
    }
    
  7. While MongoDB has native support for GeoJSON, mongoimport will interpret the normal GeoJSON file as a single JSON document. That's not what we want. What we really need is to have each county as a JSON document. The easiest way to get this done is with one of my favorite JSON tools, jq. We'll use our Dockerized version of jq.

    $ jq --compact-output ".features[]" us_counties_mapshaper.json > us_counties_jq.json
    

    The parameter .features[] is a query to jq and will extract each county from the features array in the original file.

Import into MongoDB

It's finally time to import our file into MongoDB using mongoimport. If you don't have MongoDB installed, please take a look at these instructions to do so. In the command below, the target database is named tornadoes and the collection name will be counties. If the database doesn't exist, MongoDB will automatically create it.

   $ mongoimport --db tornadoes -c counties --drop --file us_counties_jq.json

Import NOAA Historical Tornado Data

  1. What about the tornado data? Let's quickly run through the same steps for getting the NOAA data into MongoDB using our Dockerized commands. Start in the root directory of the GitHub repo.

    $ mkdir tornadoes
    $ cd tornadoes
    $ unzip ../source_data/torn.zip
    $ cd torn
    $ ogr2ogr -f GeoJSON -mapFieldType Date=String -t_srs EPSG:4326 tornadoes.json torn.shp
    $ jq --compact-output ".features[]" tornadoes.json > tornadoes_jq.json
    $ mongoimport --db tornadoes -c tornadoes --drop --file tornadoes_jq.json
    
  2. Since we're working with fairly compact JSON objects in the tornado file, we can skip the mapshaper step. However, we do need to make one change in MongoDB so that our geospatial queries work. The tornado collection has to be changed so that the geometry key is a 2D spherical system. To do this, we'll just use the MongoDB shell.

    MongoDB shell version: 3.2.6
    connecting to: test
    > use tornadoes;
    switched to db tornadoes
    > db.tornadoes.ensureIndex( { geometry: "2dsphere" } );
    {
       "createdCollectionAutomatically" : false,
       "numIndexesBefore" : 1,
       "numIndexesAfter" : 2,
       "ok" : 1
    }
    > quit();
    

    If you don't change the geometry, MongoDB won't be able to properly determine distances from a point. This is particularly important if you want to find out how many tornadoes have occurred within a set radius from a point on the map.

    Note: You might be wondering why I didn't do this for the US county data. I have a good reason. Boundaries in Shapefiles are called polygons. Most Shapefiles when you convert them to GeoJSON have self-intersecting polygons. Basically, the boundary crosses itself kind of like a bow tie. This is not allowed with the GeoJSON specification. If you were to set up a MongoDB collection with a 2dsphere geometry and try to import a self-intersecting polygon, it would fail. I have yet to find the perfect tool to clean up these situations. (If you know of one, please leave me a comment below.) So for now, I do not set a 2dsphere geometry on large data sets like the US county boundaries.

Run the Queries

  1. We are on the home stretch. Let's run some interesting queries on the data we've just imported. First, let's find out how many tornadoes have happened in Tarrant, TX, since 1950. As I said earlier, Tarrant is my home county — so that's why I find the boundary interesting. The MondoDB query will look like this:

    printjsononeline(
       db.tornadoes.find(
         {
           geometry : {
             $geoWithin : {
               $geometry :
                 db.counties.findOne(
                   {
                     "properties.NAME": "Tarrant",
                     "properties.STATEFP": "48"
                   }).geometry
             }
           }
         },
         {
           _id: 0
         }).toArray()
       );
    
  2. The basic idea here is that we're querying for any tornado in the tornadoes collection that falls within the boundary of Tarrant, TX. The output will be on a single line, printjsononeline, and will be a JSON array, toArray(). I also don't want the MongoDB document ID to be returned. So I've added { _id: 0 } to my query.

  3. If you want to change counties to the one where you live, you'll need your state FIPS code. You can look that up on this page.

  4. Here's how you run the query from the command line:

    $ mongo --quiet tornadoes query_tarrant_tornadoes.js | jq --compact-output '{type: "FeatureCollection", "features": .}' > result_tarrant_tornadoes.json
    

    The above query is stored in the file query_tarrant_tornadoes.js. The output from MongoDB is passed to our Dockerized jq to reformat the output as proper GeoJSON.

  5. Let's do one more quick example to illustrate a slightly different query in MongoDB. Say I want to find all the historical tornadoes within 100 miles of a given point. I'm picking the geographic center of Tarrant, TX, for this example. Let's start with the query:

    printjsononeline(
       db.tornadoes.find(
         {
           geometry : {
             $near : {
               $geometry : {
                 type : "Point",
                 coordinates : [ -97.25, 32.75 ]
               },
               $maxDistance: 160934
             }
           }
         },
         {
           _id: 0
         }).toArray()
       );
    

    The main differences to note here are, first, that we use the $near query. For more information, check out the MongoDB documentation. Then, instead of finding something near a county, we want to find all tornadoes near a point. So, the $maxDistance is in meters here, and 100 miles is about 160,934 meters.

  6. Running the query happens exactly like we did above.

    $ mongo --quiet tornadoes query_tarrant_100mi_tornadoes.js | jq --compact-output '{type: "FeatureCollection", "features": .}' > result_tarrant_100mi_tonadoes.json
    

Now what?

This is just the first part of our process. We've got the data cleaned up and ready to use. Next, we need to visualize it. That will be the focus of Part Two in this post series. You might think it's easiest to just paste everything in Google My Maps and call it a day. But I've got a better (and much more useful) idea. We'll be looking to Docker again to create our own Node.JS app and then view and even interact with our data in a browser. Stay tuned — and in the meanwhile, post your comments and questions below.

If you've found that I'm discussing tools or processes you're not familiar with, please don't hesitate to ask how to do something. I'm accelerating a few things here based on assumption of experience, but this is stuff that anyone can learn with the right preparation.

Thanks for reading,

Matthew Close - Security Engineer

Sign up for our Developer-focused newsletter CODE. Designed hands-on by developers, for developers. Keep up to date on topics of interest: tutorials, tips and tricks, and community building events.

We’re a different kind of cloud provider — let us show you why.

Sources