At gap intelligence we’re about that data…to be specific we’re all about that GFD (Great Freaking Data)! We have a handful of applications that use GFD and it's extremely important to our business. With all of this valuable data sitting in our database, why would we not build a way to plug into this data, right? Not just for external applications, but our own in-house applications. We could even offer it to our customers. And so the gap API was born.
Just to give a little background before we jump into the code. Our application is written in Ruby, using the Rails framework. For the first iteration of the API, we of course leveraged getting data using Rails’ awesome active record ORM. I’m a big fan of Active Record (AR). It makes it extremely easy to communicate with your data models. It comes with a price though in performance. So in this blog I will show you how we are slowly but surely removing AR as a dependency in our API. I will show you how we are now leveraging Elasticsearch (ES) to speed up querying our data.
Image Credit: elastic.co
Below is the controller logic for our pricing index before we converted it to use ES.
def index @pricings = Pricing filtering_params(params).each do |key, value| @pricings = @pricings.public_send("by_#{key}", value) if value.present? end @pricings = @pricings.includes(:merchant).order(order).page(page).per(per_page) render json: @pricings, serializer: ::V1::PaginationSerializer end private def filtering_params(params) params.slice(:category_name, :part_number) end
This is a pretty straightforward approach. We are looping through the allowed filtering parameters passed into the index action. In order for this to work we must have model methods defined, such as by_category_name
& by_part_number
. In our case we made model scopes and they can get chained together if multiple parameters get passed in. Below you can see these scopes defined in our pricing model.
We are also utilizing pagination to limit the amount of data the API returns in the JSON response. Lastly the index action will render the JSON using a serializer. We use the Active Model Serializers (AMS) gem. One of the reasons we chose to use AMS is to take advantage of the JsonApi Adapter, which allows us to easily format our JSON using the JSON API specification (jsonapi.org/format). In the controller code we are going through a PaginationSerializer which is shared by all serializers. It dynamically figures out which model serializer to use based on the model. In this case the PricingSerializer you see below.
Pricing Model
class Pricing < ActiveRecord::Base scope :by_category_name, -> (categories) { joins(:category).where("LOWER(categories.name) IN (?)", categories.downcase.split(',')) } scope :by_part_number, -> (part_numbers) { joins(:product).where("LOWER(products.part_number) IN (?)", part_numbers.downcase.split(',')) } ... end
Pricing Serializer
class V1::PricingSerializer < V1::BaseSerializer attributes :id attributes :date_collected attributes :shelf_price attributes :in_stock attributes :merchant attributes :product def merchant object.merchant.name end def product object.product.name end end
So our API works and returns us the data we need. Awesome! However, our pricing model has millions of records in it. Which means when we query against the database it can be quite slow, especially when paginating deeper into the data set. This is where you will see some really slow response times. Performance is extremely important to any API. You want your API to return data quickly. With performance being so important we decided to give ES a shot.
We decided to use a gem called Searchkick. Searchkick is a gem that runs on top of ES and makes it simple to search your ES data in a Rails-like fashion.
Image Credit: rubyinrails.com
Let's Get Started!
First things first, we need to decide what data we are going to store in the search index. In Searchkick you must implement the search_data
method in your model. For example in the pricing model:
def search_data { category_name: category.name.downcase, part_number: part_number } end
After you have defined what data is indexed, you simply call Pricing.reindex
to index your data. Keep in mind every time you change the search_data method you must reindex the data. One way to speed up the indexing is to eager load your associations. In Searchkick you can define the search_import
scope like this:
scope :search_import, -> { includes(:merchant, :category, :product) }
OK, we have our data indexed. Now how do we query ES to get the data we need? Searchkick comes with some nice & easy ways of searching the data, but as your searches become more advanced, its actually recommended to use the Elasticsearch DSL, which Searchkick conveniently fully supports. So with that lets get into the controller logic.
def index @pricings = Pricing.search( include: [:category, :merchant, :product], query: query, order: order, page: page, per_page: per_page ) render json: @pricings, serializer: ::V1::PaginationSerializer end private def build_query query_array = [] filtering_params.each do |key, value| if value.present? query_array << Pricing.public_send("search_by_#{key}", value) end end query_array end def filtering_params params.slice(:part_number, :category_name) end def query { bool: { must: build_query } } end
As you can see there is some familiar code here. We still have the filtering_params
method. We are calling slightly different methods as we loop through the parameters. These methods are defined in a model concern and gets included into the model. For example:
Model
class Pricing < ActiveRecord::Base include PricingSearchable end
Model Concern
module PricingSearchable extend ActiveSupport::Concern include Searchable included do after_save :reindex scope :search_import, -> { includes(:merchant, :category, :product) } def search_data { category_name: category.name.downcase, part_number: part_numbers } end end module ClassMethods def search_by_category_name(category_names) { terms: { category_name: category_names.downcase.split(',') } } end def search_by_part_number(part_numbers) { terms: { part_number: part_numbers.downcase.split(',') } } end end end
I like that all the ES logic is in a concern now, which keeps the pricing model cleaner.
As for building the ES query, we are using what is called a Bool Query. This is a query that matches documents matching boolean combinations of other queries. With each boolean clause you must set a typed occurrence, in this case ‘must’. Which means all those query clauses ‘must' appear in matching documents. So for example if I wanted to search by a category of ‘TVs’ and a part number of 123 then only documents which match both conditions are returned in the query results. For more information on the query DSL, visit the ES documentation.
The rest of the parameters we pass to the search method are pretty straightforward. The include parameter is for eager loading of related models. The order parameter is for sorting the result set. Keep in mind the data you want to sort by needs to be added to the ES index. And lastly we have the pagination parameters page
& per_page
.
And Now for the Results!
For our performance testing we wanted to see how well the API performed when trying to paginate deep into a large result set. I wrote a script that would hit a specific page in the result set and got an average over 3 requests. Below is a table displaying how many milliseconds each request took and it’s obvious how much it improved when leveraging ES in our API. This was a huge win for us to see these performance gains.
Pagination Page | API w/ActiveRecord | API w/Elasticsearch |
---|---|---|
100 | 895ms | 110ms |
1,000 | 1425ms | 221ms |
10,000 | 1899ms | 301ms |
100,000 | 5106ms | 2810ms |
200,000 | 6620ms | 3683ms |
300,000 | 8282ms | 3497ms |
400,000 | 10202ms | 3525ms |
500,000 | 12515ms | 4046ms |
Although we’ve made the API faster, we feel there is still room for improvement. For example this ES implementation still uses our active model serializers (AMS), therefore there are still calls to AR in our codebase. So there is still work to do in our current version of the API. Our plan is to eventually eliminate active record in our API. In order to do so we’ll need to throw more data into the ES index, and use it almost as another persistence layer, but only for the API. If and when we go this route, we will have to format the ES response into the JSON API format, which we get out of the box with AMS. Outside of that, not much code would have to change and it should get even faster!
That’s a wrap folks. Questions and feedback are always welcome, so don’t hesitate to contact me at ecorreia@gapintelligence.com. I’m always looking for better ways to improve our codebase.
Image Credit: jetruby.com