Blog of Jean-Daniel Fekete

1.  Managing Historical Dates in Various Languages

July 1st, 2015

For the Cendari project, I have to manage large amounts of historical data with dates. This may seem easy to do, since all the modern programming languages come with extensive support for dates, but it is actually very complicated because the implementations of date objects are too limited.

This is true for Python and JavaScript. I have not checked other languages for now but serious Java programs use the library Joda Time instead of the standard date and time classes.

What is the problem? When dealing with dates below 1970, strange things happen:

  • Python does not support dates before Jan. 1st, 1 (beginning of the Christian Era). It just does not, and raises an error when trying to create a date with negative years.
  • JavaScript can manage dates BC, but will automatically interpret years with 2 digits as implicitly in the 20th century, so new Date(1, 1, 1) will create a date of Jan. 1st, 1901.

So, what can be done? In Python, I had to re-implement some functions to manage dates "by hand". In JavaScript, I had to call explicitly date.setFullYear(1) to reach to year 1. Even copying a correct date with year==1 to a new date produces the 1901 bug.

To implement the date functions correctly, I took inspiration from Fourmilab's calendar converter. It is a JavaScript program that contains conversion from a lot of calendars, and useful functions to compute the number of days to some reference "Epoch" for all these calendars. This is what I used to compute dates in Python.

Searching other solutions, I also found the astropy package, but it is huge and requires more external packages such as numpy when I only needed 2 functions in the end. However, the astropy.time package is amazingly extensive.

I wonder why this is not solved once for all in all the programming languages. Dates are well understood, and I see no excuse for providing these incomplete buggy versions in the mainstream languages.

2.  ElasticSearch and TimeLines Visualization

July 1st, 2015

ElasticSearch is a very nice NoSQL database aimed at full-text indexing, very fast and scalable queries, and powerful aggregation functions. It is almost perfect for visualization, but can be tricky. I've been tricked many times so I'll explain here to keep a record of it.

2.1  Adjusting resolution for Maps and Time Histograms

I have a huge database of documents containing entities, such as persons, places, events/dates, etc. I want to provide a faceted browsing interface for searching them. For testing its scalability, I have loaded the whole English Wikipedia using Dbpedia. So, for each article, I have a list of people, places, and dates. ElasticSearch can aggregate them and that's great!

For places (locations), if I have the geographical coordinates, it will return me a list of aggregated points on a grid over the map. The resolution of the grid can be specified using the GeoHash grid Aggregation function that takes a precision value between 1 and 12. 1 provides a grid-resolution of about 5000km and 12 about 3cm.

Now, when someone requests all the documents about "Paris", what precision value should I ask? Well, there's no way to know in advance. You have to issue the query once asking for the geographical bounding box first (an aggregation option), from which you can compute the precision, and perform a second query asking for the results of the grid aggregation. It may sound complicated and expensive but actually, ElasticSearch caches the results of its search filters so the second query is extremely fast.

Same thing for a timeline. You can ask Elasticsearch to return you a histogram of documents over time, but you need to specify the size of the histogram buckets. If I know that the dates range from 1776 to 2016, and I want to visualize the histogram with 20 buckets, I need to create buckets of 12 years.

Great, but how do I know the time-span of my query? Again, I have to do it twice. Not too bad, it works. But then, ElasticSearch tells me that Date Histograms are meant to manage date-related buckets because they are not as easy as for normal histograms, years vary in length etc. So ElasticSearch provides several date units: year, month, week, day, etc. So I specify that the buckets are 12 years long. "12y"... and it returns an error. The only possible unit is "days", ElasticSearch will not compute multiples of years, months, weeks.. so I ask for round(12*365.25) days, which somehow defeats the purpose of ElasticSearch managing the leap years etc. But then, it works, or almost.

So I show the whole timeline of Wikipedia articles, ranging from -2333-10-02 to 2015-06-07 (I have restricted the dates in the future). When I zoom in, I expect to specify a new date range, filter documents and get that range back. So that's what I do: from 1-1-1 to 1000-01-01...

And I receive a date range much larger back.

How comes?

After multiple tests to understand, here is what happens: ElasticSearch queries all the documents that contain dates in the specified range. That is the result-set. Then, it computes the aggregation: for all these documents, fill-up the buckets with a length of 50 years. But when ElasticSearch does that aggregation phase, it takes all the dates related to each article, even the ones that are outside the specified range.

So what should I do then? ElasticSeach will not filter the buckets for me, so I have to filter the buckets after I receive them.

This behavior is surprising but the semantic is right. It is just not the one expected to implement a timeline visualization. This shows that it will take a bit of time for Visualization and Databases to come to a common ground on what is needed to explore large databases. But the low-level primitives are almost there in ElasticSearch; I know the authors are also writing about what they call Reducers - Post processing of aggregation results. I think that will solve my problem in the future when it's done.