Resolving a performance issue in a Rails platform. cancancan, New relic, N+1 

(Everyday Code – instead of keeping our knowledge in a README.md let’s share it with the internet)

Today’s article is about a performance issue that we had in one of our Rails platforms and how we resolved it. I enter into details of the specific root cause and how New relic was helpful in identifying the issue. This article is mainly targeted for our team, but I hope it would be useful for the community to see a practical example of cancancan and N+1.

The root cause

Every time we show a lesson on the FLLCasts platform we show the tutorials and materials of that lesson. But if the user does not have access to a tutorial, because it is not published or it requires a teachers subscription we do not show it in the lesson. We filter it from the lesson.

We are using cancancan and we’ve added code for checking if every tutorial in the course could be accessed – whether you have the right subscription. This was not a problem on its own. The major problem was that we were showing a navigation of the course on the right that contains links to all the lessons with all the tutorials:

FLLCasts Course nagivation showing links to all the tutorials
Show of links to all the tutorials in the whole course in the navigation

To show the navigation we looped through all the tutorials in the whole course and checked if the user has access to them. It was an N+1. This seems to be working when a course contains < 100 tutorials. But once the courses grew and contained more tutorials the logic got slow.

People started noticing. We had to increase the number of Heroku dynos from time to time. Especially when a lot of students join and use the platform.

There were two problems

  1. We don’t need to show all the tutorials and building instructions in the navigation. We just need to show them for the current lesson.
  2. We need to make a singe query to the DB that would get only the tutorials in that lesson and only the tutorials the user has access to.

What was the performance impact

During load when a lot of students were using the platform the navigation/_course_section partial was taking 41% of the show lesson request average of 5 seconds. There were about 100 find queries for a single lesson.

Table with how expensive it was to show the navigation for large courses

It was taking about 10-15 seconds for the server to respond and sometimes requests were failing.

Chart showing slow responses
Responses were slow. > 15 second sometimes.

On the chart below we see how the moment people start opening courses the time for showing lessons (course_sections) starts growing.

Chart showing sharp increase
Sharp increase in time the moment we make requests

Practically the platform was unusable in this case. Everything stopped.

Solutions

Change the navigation to include only the current lesson

We decided that we don’t need to show all the tutorials from the whole course in the navigation, only the tutorials for the current lessons. This reduced the load because instead of returning 140 tutorials we were returning 20.

Improve the call to cancancan

One thing that we’ve missed is a call to cancancan that is can? :show for each record. It has slipped in the years and we haven’t noticed.

records.each do |record|
   if can? :show, record
     ...
   end
end

I refactored it to use cancancan accessible_by method.

@content_refs = @course_section
        .content_refs
        .accessible_by(current_ability)
        .order(:position)

and added two new :index rules in the ContentRefs ability. The benefit is that in this way cancancan makes a single query to the db.

module Abilities
  class ContentRefsAbility
    include CanCan::Ability

    def initialize user
      can :index, ContentRef, { archived: [nil,false], hidden: false}
      can :index, ContentRef, { archived: true, hidden: [true,false], course_section: { authors: { user_id: user.id }}}
      ...
      can :read, ContentRef, { archived: [nil,false], hidden: true ,course_section: {course: {required_access_for_hidden_refs: {weight: 0..user_max_plan_weight}}}}
      ...
    end
  end
end

The third rule is the deal breaker. With it we check if the user has a subscription plan that would allow them to access the content ref.

With these three rules we allow everybody to see non archive and non hidden content_refs, we also allow authors of the course_section to see archived and hidden content_refs where they are the author of the course_section and we allow allow everybody to see hidden content_refs if they have the proper subscription plan.

Results

Number of timeouts reduced to from 180 to 6

jenkins@elvis:/var/log/fllcasts$ grep "Rack::Timeout::RequestTimeoutException" access.log | wc -l
6
jenkins@elvis:/var/log/fllcasts$ zgrep "Rack::Timeout::RequestTimeoutException" access.log-20210320.gz | wc -l
183
jenkins@elvis:/var/log/fllcasts$ zgrep "Rack::Timeout::RequestTimeoutException" access.log-20210319.gz | wc -l
164

Increase in throughout put does not increase the required time

It does not matter whether we have 1 or 40 rpm for this controller. The time stays the same. Next we can focus on how to make it even faster, but this will do until the platform grows x10