A tremendous amount of data is kept as CSVs, or Comma-Separated Values. Working with spreadsheets through programs like Google Spreadsheets can let you perform pretty magical calculations, but as the spreadsheet grows, it grinds your machine to a halt. Nevertheless, the ubiquity of the CSV means even tremendous files are kept that way. How can we make them easier to process?
Enter orc-csv, a command-line tool for uploading CSVs to Orchestrate, bundled with a web server to work with them from your local machine. We’ll use it to explore top Reddit posts.
To follow along, you’ll need to install Node.js. Once you have it, in a terminal, run this:
sudo npm install -g orc-csv
Then, if you haven’t already, sign up for Orchestrate! Once you’re logged in, create an application. Then, with any csv, you can do this:
cat path/to/file.csv | orc-csv -u YOUR_API_KEY -c COLLECTION_NAME
… Where YOUR_API_KEY is the API key of the application you just created, and
COLLECTION_NAME is the name of the collection you want to group this data under.
CSV_URL=https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/programming.csv curl $CSV_URL | orc-csv -u $YOUR_API_KEY -c r-programming
That CSV contains one thousand records, so it’ll take a moment to upload. Enjoy the pause with Ingrid Michaelson’s Girls Chase Boys :D
Once it’s done, that’s it. You turned some CSVs into a searchable, workable, API.
Exploring the Data
We can use orc-csv to let us explore the data right from your browser, like this:
orc-csv server -u YOUR_API_KEY # Listening on port 3000
From your browser, you can visit http://localhost:3000/v0/r-programming, and you’ll see the first page of items in that collection. orc-csv server uses your API key to create an authenticated proxy to Orchestrate, so you can work with the data without having to put in your credentials. It’s a lot like the Orchestrate dashboard but handier if you’re used to working on the command line.
For example, we can use jq to process the JSON results of searches, like so:
curl “http://localhost:3000/v0/r-programming?” | jq .results.value.title
… which yields a list like this:
“New approach to China” “\”The idea that I can be presented with a problem, set out to logically solve it with the tools at hand, and wind up with a program that could not be legally used because someone else followed the same logical steps some years ago and filed for a patent on it is horrifying.\” John Carmack” “Vote for Barbie to be a computer engineer!” “An absolutely brilliant analogy as to why software development task estimations are regularly off by a factor of 2-3” “Breaking down Amazon’s mega dropdown” “Google Officially Announces Chrome OS” “Dialup handshake explained” “Simulating cloth” “\”Dad? Why do we always use .NET?\” — I’m not a big fan of Java, but this movie trailer is brilliant!” “R.I.P. John McCarthy, father of AI, inventor of Lisp, suddenly at home last night.”
As a next step, consider uploading each of the CSVs from the reddit-top-2.5-million project into the same collection using
orc-csv. Even just using command-line tools like curl and jq, you can effortlessly extract meaningful data.