Making N+1 query detection actually useful at scale

Why N+1 queries matter

An N+1 query is one of the most common performance anti-patterns in applications powered by relational databases. It happens when code loads a collection of records, then lazily fetches an association for each one individually effectively turning what should be 1 query into N+1.

# 1 query to load users
users = User.where(team_id: team.id)

# + N queries, one per user, to load their profile
users.each { |u| puts u.profile.bio }

At large scale where N might be 100, 200, or 1,000, it’s not a great thing:

Each query has a round-trip cost to the database. 100 sequential round-trips can easily add hundreds of milliseconds to a request and unnecessarily add latency to a sub-100ms page load.
If each individual query is cheap, it won’t show up in slow query logs. Collectively, they monopolise connection pool slots and saturate the database with thousands of tiny queries per second.
In a threaded server like Puma, each request holds a database connection for the duration of its N+1 loop. If requests queue up waiting for connections, you get cascading latency spikes, leading to a noisy neighor type problem.
N+1s often hide inside serializers, GraphQL resolvers, or view partials where the association access is one line buried in a loop. If your GraphQL query resolves object → object → object, you can get N+1s at each level, compounding the damage.

The fix is usually straightforward by using an includes, preload, or eager_load but finding them in a large codebase is the hard part. That’s where automated detection with Prosopite comes in. Prosopite watches for repeated similar queries within a single request and reports them.

The beginning

We use Prosopite in our Rails app to detect N+1 queries. We shipped a patch that monkey-patched Prosopite.send_notifications to emit a telemetry span for every N+1 query detected in production.

# The original patch — clean and simple
Honeycomb.start_span(name: "n_plus_1_query_report") do |span|
  span.add_field("n_plus_1.query", query)
  span.add_field("n_plus_1.count", count)
end

This gave us visibility but the data was unusable at scale.

Problem 1 - High-cardinality queries can’t be grouped over time

The raw SQL includes literal values like (WHERE id = 1, WHERE id = 2, …) so every query looks unique to the telemetry. You can’t GROUP BY the query field to answer questions like “which N+1 patterns are the worst?” especially over windows longer than 24 hours due to ths sheer volume of data.

Fix 1: Use PgQuery to compute a stable fingerprint for each SQL statement. PgQuery.fingerprint produces a hash that’s identical regardless of literal values or formatting. I also added PgQuery.normalize to replace literals with $1, $2 placeholders for a more readable version.

fingerprint = PgQuery.fingerprint(query).to_s        # e.g. "a1b2c3d4..."
normalized  = PgQuery.normalize(query)                # e.g. "SELECT ... WHERE id = $1"
span.set_attribute("n_plus_1.query_fingerprint", fingerprint)

Problem 2 - SQL comments broke the fingerprinting

After deploying the fingerprinting, I noticed queries with the same logical structure producing different fingerprints. This was due to SQL comments injected by tools like Rails’ Marginalia or pg_stat_statements annotations.

SELECT "users".* FROM "users" WHERE "users"."id" = 1 /*application:MyApp,controller:users*/
SELECT "users".* FROM "users" WHERE "users"."id" = 2 /*application:MyApp,controller:users,action:show*/

Fix 2: Parse and deparse the SQL before fingerprinting. PgQuery.parse(sql).deparse round-trips the query through a real PostgreSQL parser which strips comments and normalizes formatting.

clean_query = PgQuery.parse(query).deparse  # Comments stripped, formatting normalized
fingerprint = PgQuery.fingerprint(clean_query).to_s
normalized  = PgQuery.normalize(clean_query)

I also tweaked two things here:

n_plus_1.query now shows the normalized form by default.
added n_plus_1.query_raw (truncated to 500 chars) for debugging when you need to see a deeper context.

Problem 3 - Missing GraphQL context on N+1 spans

N+1 spans appeared in the telemetry backend were not associated to the GraphQL operation that caused them. Prosopite’s send_notifications fires at the end of a request which is outside the GraphQL trace span hierarchy so add_attribute_to_trace which propagates attributes to child spans didn’t help. The N+1 span isn’t a child of the GraphQL span.

Fix 3: Created a RequestContext class using ActiveSupport::CurrentAttributes which is a thread-safe and request-scoped store that resets between requests. The GraphQL tracer writes the operation name into it and the Prosopite patch reads from it, allowing us to pass context across decoupled parts of the request lifecycle.

# In the GraphQL tracer:
RequestContext.graphql_operation_name = op_name

# In the Prosopite patch:
graphql_operation_name = RequestContext.graphql_operation_name
span.set_attribute("graphql.operation_name", graphql_operation_name) if graphql_operation_name.present?

Problem 4 - Too many telemetry spans

Even with fingerprinting, we were creating one span per raw query. An N+1 that fires 200 times in a single request produced 200 spans, all with the same fingerprint, resulting in an expensive and noisy N+1 reporting.

Prosopite’s tally groups identical raw queries but SELECT ... WHERE id = 1 and SELECT ... WHERE id = 2 are different strings so tally treats them as separate entries.

Fix 4: Fingerprint first then tally. Instead of emitting per-query, I collect all queries into a fingerprint_data hash, aggregating counts by fingerprint, and emitting one span per unique fingerprint.

fingerprint_data = {}

queries.each do |query|
  clean_query = PgQuery.parse(query).deparse
  fingerprint = PgQuery.fingerprint(clean_query).to_s

  existing = fingerprint_data[fingerprint]
  if existing
    existing[:count] += 1
  else
    fingerprint_data[fingerprint] = { count: 1, raw_query: query, clean_query: clean_query }
  end
end

# Now emit ONE span per unique pattern
fingerprint_data.each do |fingerprint, data|
  Telemetry::TinesTracer.in_span("n_plus_1_query_report") do |span|
    span.set_attribute("n_plus_1.count", data[:count])
    span.set_attribute("n_plus_1.query_fingerprint", fingerprint)
    # ...
  end
end

Problem 5: Who owns this N+1?

We could now see the N+1 patterns and which GraphQL operations triggered them. As we scale, some teams are interested in being alerted with these N+1s to provide a better user experience and be sound to the infrastructure.

Fix 5: Pull the code_owner tag from Sentry’s scope which was already set by our request middleware based on code ownership and attach it to the N+1 span.

code_owner = Sentry.get_current_scope.tags[:owner]
span.set_attribute("code_owner", code_owner) if code_owner.present?