This is what I had been looking for a long time now. I have a table where I accumulate analytical data about customers’ usage like storage, bandwidth, licensed users, etc. I get a daily report for each of the server. My goal was to get the most recent data for any customer/server and so I found a way to do that. Thanks to this post. In spite of bookmarking this link, I decided to make notes of how the query for the LEFT JOIN works to get the last set of data for a customer.
Here’s the query I would use to get the latest row for each hostname:
SELECT u.* FROM usage_data u LEFT JOIN usage_data ut ON (u.hostname = ut.hostname AND u.id < ut.id) WHERE ut.id IS NULL ORDER BY u.hostname
Let’s assume we have the following set of data for a customer/hostname – usage_data(id, hostname, report_date, storage, bandwidth, licensed_users ….).
id | hostname | report_date | ..... 1 | customer.domain.com | 03/15/2015 | ..... 2 | customer.domain.com | 03/16/2015 | ..... 3 | customer.domain.com | 03/17/2015 | .....
When I do a simple left join, ON (u.hostname = ut.hostname), I’ll see resultant data with following id mappings:
But when I add another condition, the join looks like ON (u.hostname = ut.hostname AND u.id < ut.id), I’m only mapping rows for which u.id < ut.id. Therefore, you get the following mappings for id in [1, 2]
Now, for the id=3 in table u, there isn’t a row in ut that satisfies the condition u.id < ut.id. But since we’re using a LEFT JOIN, the database engine ensures that there is an entry for each row on the LEFT table. Therefore, the engine returns row for id=3 in table u and replaces by NULL for each value in table ut so our result set looks like the following
That’s what we wanted. A way to identify a row uniquely that is the latest. Now if we add another WHERE clause in the result set, to get only that data that has ut.id as NULL, we will get the row corresponding to the latest id value in table u. Therefore, the LEFT JOIN query makes a little more sense to me now and hopefully to you as well.