A number of Orchestrate users have asked for an in-depth guide to using Lucene queries (Lucene Query Parser Syntax) with Orchestrate. To that end, I’ve loaded up an Orchestrate application with ~10,000 emails from Enron and created a small AppFog/Orchestrate app that provides a way to search an Orchestrate application. The AppFog app is available on Github. Using the form at the bottom of each section, you can query that app and test your Lucene skills. Feel free to munge the parameters and see what’s the effect. Let’s get started.

About the Dataset

The dataset used is a sampling of emails from Enron’s corporate servers that have been converted to JSON. The original data can be found here. The following is a doc that demonstrates the form of the JSONified emails:

  "X-cc": "",
  "From": "[email protected]",
  "X-Folder": "\\MCASH (Non-Privileged)\\Cash, Michelle\\Inbox",
  "Content-Transfer-Encoding": "7bit",
  "X-bcc": "",
  "X-Origin": "Cash-M",
  "To": [
    "[email protected]"
  "parts": [
      "content": "\r\n=================\r\nMike Baer\r\nAccenture - Legal Group\r\n100 South Wacker Drive, Ste. 515, Chicago, IL 60606-4006\r\nVoice: 312-693-1512, Octel: 69/31512, Fax: 312-652-1512\r\neMail: [email protected]\r\n=================\r\n\r\n\r\n\r\nThis message is for the designated recipient only and may contain\r\nprivileged or confidential information.  If you have received it in error,\r\nplease notify the sender immediately and delete the original.  Any other\r\nuse of the email by you is prohibited.\r\n - B0033CF1.TIF \r\n\r\n",
      "contentType": "text/plain"
  "X-FileName": "MCASH (Non-Privileged).pst",
  "Mime-Version": "1.0",
  "X-From": "[email protected]@ENRON\r\n <[email protected]>",
  "Date": {
    "$date": 993750113000
  "X-To": "Cash, Michelle </O=ENRON/OU=NA/CN=RECIPIENTS/CN=MCASH>",
  "Message-ID": "<[email protected]>",
  "Content-Type": "text/plain; charset=us-ascii",
  "Subject": "Enron CSA 12/17/97 - Signed"

Terms and Phrases

The simplest query is a term query. Using the query confidential the entirety of each document in the corpus will checked for a match. You can match phrases by quoting them, for example: "confidential materials".


Required and Prohibited Terms

+ and - can be used to require and prohibit terms respectively. +legal action file suit -california requires that “legal” is present and “california” is not present in matches.



To address the field “From”, simply prefix the term to search with “value.From”. For example, to search for all documents from [email protected] query with value.From:"[email protected]".


Boolean Operators

Lucene supports standard boolean operators AND, OR and NOT. This allows for complex queries, such as value.To:"[email protected]" AND NOT "press release".



Grouping allows for complex boolean statements to be made, like (price AND fixing) OR (market AND manipulation).


Field Grouping

Statements can be grouped, such that a boolean statement can apply to a specific field. For example, value.To:("[email protected]" AND "[email protected]") will return emails addressed to both [email protected] and [email protected].



Lucene offers two wildcards ? and *. ? matches any single character. * matches any number of characters. For example, value.Subject:?enn* will search for any emails with a subject with a word that starts with any character followed by “enn” and any suffix.


Regular Expressions

Regular expressions offer a powerful way to extract matches. While lucene’s regular expression support is not Perl-compatible, it offers many of its features. To match any recipient named Catherine or Katherine, use value.To:/(c|k)atherine/.


Fuzzy Search

Fuzzy search allows terms with a discreet number of edits (default 2) to count as matches. For example, value.parts.content:(california manipulate~4) will match emails that contain California and words within 4 edits of “manipulate”, including manipulated, manipulating, manipulation, and manipulator.


Proximity Search

With proximity search, the maximum distance of terms can be specified. For example, value.parts.content:"prices manipulate"~4 will search for emails with “prices” and “manipulate” within 4 terms of each other.



All search results are scored based on relevancy. Orchestrate returns the highest scoring results first. Lucene provides ways to “boost” specific terms so that you can value some terms more highly than others. value.parts.content:(conspiracy^3 california^2 market) will boost results with “conspiracy” by 3 and those with california by 2.



Querying ranges is simple with Lucene. To search for all emails sent on 2001–06–28, the following query could be used based on Unix epoch milliseconds: value.Date.$date:[993682800000 TO 99376920000}. The [ indicates the start of the range is inclusive of the specified value whereas } indicates that end of the range should not be matched.



Choose your own adventure

That’s the end of the guided part of the tour. From here on out, the journey is in your hands. Safe travels!