Finding the most recent item by group in Rails

By Stephen Karger on Jun 21, 2019 9:09:03 AM
A certain kind of question comes up sometimes when building an application with a relational database:
What is the most recent item for each category?
It often happens with a one-to-many table relationship. You might have a departments table and an employees table, where a department has many employees, and you would like to know each department's most recently hired employee.
The general version of this question even has a name, Greatest N Per Group, and it's not limited to one-to-many relationships, nor to most recent items. Nevertheless, that example is broadly useful because it arises naturally when inserting records into a DB table over time. In this post I'll show the query using standard SQL and walk through integrating it into Rails. Read on for details!

Example Domain Model

Let's say we're building an application for people training for triathlons to help them log their workouts. A triathlon is a race with three stages: swimming, then cycling, then running. To prepare for their race, our users practice each sport and record their workouts.
The application has two tables, sports and workouts, along with two corresponding ActiveRecord models.
Our sports table has three rows. These records represent the "categories," aka the groups.
ID    NAME       TIPS
1     swimming   Swim in open water when you can.
2     cycling    Know how to fix a flat tire.
3     running    Find shoes that fit well.
To keep this example simple, we only have three rows. In real applications the categories table often has many more, which provides more motivation to consolidate the related-items lookup into a single query.
Then we have a workouts table where each record includes a reference to its sport. The single user we have right now has done five workouts:
1     1        2          First bike ride   2019-06-15 10:15:00
2     1        2          Second bike ride  2019-06-16 14:30:00
3     1        1          First swim        2019-06-18 18:00:00
4     1        1          Second swim       2019-06-20 17:45:00
5     1        3          First run         2019-06-21 07:00:00
We're assuming that the standard Rails created_at timestamp represents when the workout happened in real life. We might want a separate timestamp column, in order to allow users to specify a past or future time relative to the moment they log it. That quirk only applies to domains where there's a meaningful time value besides the moment it's entered into DB.
Now we can ask our big question:
For each triathlon sport, what the most recent workouts User 1 completed?

Inspecting the data, we can see that the answer is these three records:
2     1        2          Second bike ride  2019-06-16 14:30:00
4     1        1          Second swim       2019-06-20 17:45:00
5     1        3          First run         2019-06-21 07:00:00

Writing the SQL query

We can find the same answer in SQL with a query that groups by sport type, orders by when the workout was recorded, and joins back to workouts to choose the record with the latest timestamp:

There are other ways to write this query. In Postgres you might want to replace the GROUP BY and JOIN with a DISTINCT ON expression, or use a window function like row_number(). You may want to experiment to find the query that yields the best performance with your data.

Another subtle point: the query above joins on created_at. In theory, you can have more than one row in workouts for a given sport_id with the same created_at values, which will cause this query to return multiple most recent workouts for that sport, violating the application's expectations.

Depending on your application, that situation may be impossible in practice. If it can happen, you need to refine the query to ensure you obtain a single most recent related item for each category. Unfortunately, when limiting ourselves to standard SQL, the more robust query becomes much harder to read:

Database-specific features really start to shine here. The Postgres version below obtains the same desired results. Look how much more concise it is!

For the rest of this example, we will assume that duplicate created_at values are impossible due to how our application works, so we can stick to the simpler standard SQL version.

Integrating the query into Rails

We want to use the results of this query in our Rails app, and write code like the following:

# run the SQL query above, load the results into Workout model instances
# query sports, and eager load their associated most recent workouts Sport .where(name: ['swimming','cycling'], workouts: { user_id: 1 }) .includes(:most_recent_workout)
How do we plug the SQL query into our ActiveRecord models?

Scope for a custom query

We start by defining a scope in Workout, which is the "many" model in this example. We utilize the ActiveRecord from method to integrate our custom SQL.

There's an important detail here: We must wrap the plain query inside a ( ... ) workouts expression. This makes it so that the generated SQL will match up correctly with normal ActiveRecord queries.

With this in place, we now have enough to put the Workout.most_recent_by_sport expression in our app.

Association for a custom query

On the other side, in our Sport model, we also want to use this query when loading associated workouts. For normal Rails associations, we get association methods for free. For example, since

Sport has_many :workouts

we can write expressions like sport.workouts, or

In particular, we can use built-in association methods to avoid an N + 1 query. Say we have a web page where we display the list of sports along with all of their workouts. The approach below will prompt an N + 1 query:

We can prevent that problem by changing the first line above to Sport.includes(:workouts).all

The includes method will prompt ActiveRecord to load the sports and their associated workouts with only two queries. Side note: If you don't like remembering to eager load associations this way, check out Salsify's goldiloader gem!

It would be ideal to have the same methods available for our "most recent workout by sport" query. We can do that with a slightly more complicated association:

The association leverages ActiveRecord's merge method to bring in our custom query. Equivalently here, we could have interpolated the SQL into a string, like so:

from("(#{Workout.most_recent_by_sport.to_sql}) workouts")

That would have been safe in this case since we fully define that SQL in our source code, but in general SQL generated with string interpolation can create a security vulnerability. Use caution in situations where your query might incorporate user-input.

All set!

With this association in place, we obtain the normal functionality of ActiveRecord associations. Our web page could now display the list of sports along with their most recent respective workouts:

One final performance note: you may need indexes to make queries like this efficient.

Good index design depends on your table's columns, the distribution of record values, and the specific queries your app needs to execute, so I can't offer a universal recommendation.

As a starting point, indexes on the "many" table's category ID and timestamp (sport_id, created_at) may help the performance of the scope query.

Read More →

Ember in 2019: Tearing Down "Us and Them"

By Dan Freeman on Jun 11, 2019 8:30:22 AM

I'm writing this near the end of the official #EmberJS2019 window, which means a lot of what there is to say has already been said. As I've read through this year's posts, there are a handful of themes I see coming up over and over again:

  • streamlining and modularizing the framework
  • broadening the Ember community
  • landing our new build system, Embroider

These points are each important and valuable in their own right, and I'm hopeful that the 2019 Roadmap RFC will address all of them. Many of the posts discussing them, though, brush up against something that I think merits a more explicit discussion.

Over the past couple years I've seen an increasing number of Ember folks display a mentality that divides the world into Us and Them. It manifests in social media interactions and blog posts, day-to-day chatter in the Ember Discord server, and even the way we frame meetup and conference talks. I think it's driven by a desire to see Ember succeed and to convince other people that they should like this thing as much as we do, but it ultimately does everyone involved a disservice.

Read More →

Our Journey from Heroku to Kubernetes

By Salsify DevOps on Dec 20, 2018 2:08:00 PM

The Decision

We, like many small startups, had started our application on Heroku, and had built up considerable technical and social practice around its use. About 18 months ago, the Engineering team here at Salsify recognized that as our team, product, and user base continued to grow, we would outgrow the ideal case for Heroku.

Read More →

A Safer Rails Console

By Timothy Su on Feb 5, 2018 1:49:06 PM

A Safer Rails Console_Salsify.jpeg

One of the blessings and curses of Rails development is the ability to use Rails console for debugging issues and inspecting data. The console is oftentimes used in a production environment, as it is the quickest method to glean information about any problems. With great power comes great responsibility: A command that attempts to reset a local developer environment by deleting all records of a model could easily be input into a production console. In addition, with potentially unknown clipboard data (thanks "copy-on-select"), any valid Ruby code with line breaks will automatically be executed if pasted. Any user who is deleting models, queuing up events, and updating records accidentally or intentionally should be made aware of the implications. At Salsify, we've solved this problem using a combination of open-source and home-brewed improvements and rolled them into a handy gem. Read on to learn more!

Read More →

Fancy trees with botanist

By Keith Williams on Jun 14, 2017 2:21:28 PM

Have you ever wanted to build your own calculator, query language, or even web browser? Parsers and Transformers are tremendously useful for these applications and many more, and thanks to tools like Salsify's own Botanist you don't need to be an expert in compiler design to work with them.

Read More →

Organizing Data Flow with Ember

By Devers Talmage on Apr 11, 2017 1:38:23 PM

One of the most important core principles of developing with Ember is "data down, actions up." You might see this concept abbreviated as DDAU in various Ember communities. The main premise of DDAU is that data should flow down through your component hierarchy (passed through and potentially modified by various components), while changes to said data should be propagated back up through that hierarchy via actions.

Read More →

Adventures in Avro

By Tim Perkins on Jun 13, 2016 8:11:57 AM
As part of our microservices architecture we recently adopted Avro as a data serialization format. In the process of incorporating Avro we created a Ruby DSL for defining Avro schemas, developed a gem for generating models from Avro schemas, and reimplemented an Avro schema registry in Rails. Here’s how we got there ... 
Here’s a situation that may be familiar: at Salsify we are moving towards a microservices architecture. Have you done that too? Are you thinking about doing it? This is a pretty common progression for startups that built a monolithic application first, found great market fit and then need to scale both the application and the team. At Salsify, we already have quite a few services running outside of the original monolith, but several months ago we started to define an architecture for how we should chip away at the monolith and structure new core services.
Naturally part of the architecture we are defining is how services should communicate with each other. For synchronous communication, we decided to stick with HTTP REST APIs that speak JSON. For asynchronous communication, we selected Apache Kafka.
We evaluated several data serialization formats to use with Kafka. The main contenders were Apache Avro, JSON, Protocol Buffers and Apache Thrift. For asynchronous communication we wanted a data format that was more compact and faster to process. Asynchronous data may stick around longer and the same message may be processed by multiple systems so version handling was important. A serialization system should also provide additional benefits like validation and strong typing.
At this point, I should inject that we’re primarily a Ruby shop. We love our expressive, dynamic language of choice, so a big factor in the selection process was how well the framework integrates with Ruby. Based on the title of this post, it's not going to be any surprise which option was the winner ...

Read More →

Delayed Job Queue Fairness

By Robert Kaufman on May 23, 2016 1:06:13 PM
At Salsify, we use Delayed Job extensively to handle asynchronous tasks. This works well for us, as it means we can finish web requests faster, resulting in a more responsive web app, while offloading non-urgent tasks to background jobs. For the most part, Delayed Job (and similar job queuing mechanisms like Resque, Celery, etc.) provide a simple and highly effective approach for running background work, making sure that it gets done, and providing a framework to scale your compute resources to handle expected workloads. Even beyond that, these frameworks create straightforward opportunities for dynamically scaling resources to handle spikes in workload. For example, we use an excellent service called HireFire to dynamically scale our Delayed Job worker pools based on queued work. Meaning, we can meet the needs of changing workload while keeping our hosting costs reasonable.

But despite all of the advantages of running background jobs, under real world usage you can still run into challenging situations that require thoughtful handling. One general class of problems that can arise is achieving fairness in resource usage across users. 

Read More →

Efficient Pagination in SQL and ElasticSearch

By Josh Silverman on Apr 20, 2016 10:06:37 AM

Many web interfaces let a user effortlessly page through large sets of data. Implementing database queries that fetch these pages is also effortless for the programmer, usually requiring an OFFSET and LIMIT in the case of SQL and a FROM and SIZE in the case of Elasticsearch. Although this method is easy on the user and programmer, pagination queries of this type have a high hidden cost for SQL, Elasticsearch and other database engines.

At Salsify, we encountered this problem when implementing a feature to allow a user to step through records in a large, heavily filtered and sorted set. We had to implement an efficient pagination solution that would work in both our SQL and ES datastores. In this post we’ll look at an interesting technique to make pagination efficient for the database with large datasets. Specifically, we’ll look at implementation in SQL as well as how to translate this method to Elasticsearch.

Read More →

Good Fences: Neighborly Styling with CSS Modules

By Dan Freeman on Feb 24, 2016 9:57:55 AM

Have you ever noticed that no one writes "How we name ourRuby variables at Company X" blog posts? No one's making the Hacker News front page with "I combined these two strategies for method naming and suddenly my JavaScript is maintainable!" And yet when it comes to CSS, developers are all about naming strategies. We sing the praises of BEM, SMACSS, OOCSS, SUIT, or whatever other set of capital letters is popular this week. Why is that?

Read More →

Recent Posts