[ActiveRecord] Feature proposal: `ActiveRecord::Batches#each_batch_bounds`

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[ActiveRecord] Feature proposal: `ActiveRecord::Batches#each_batch_bounds`

Matthieu Libeer
Hello :)

tldr: For efficiency, allow to split an ActiveRecord relation in batches by giving the bounds of the batches instead of an enumerator of relations.
This makes it possible, for instance, to split a table in chunks processed by parallel jobs.

Rationale: Doing some heavy lifting on a 1 000 000 rows table requires to use batches.
Indeed, loading all of them at once will trigger a memory overflow.
For that, ActiveRecord already offers `in_batches`, `find_in_batches` and `find_each`.
However, if the processing of one batch of 1000 records takes 5 seconds, processing the whole table will
take 1000 x 5 seconds ~= 1h30, which is not an acceptable duration for a job.
What you can do instead is to split the table in 1000 batches that will be processed in parallel
via a processing queue - for example.
To do so, you need to determine how to split the table into batches. You could do it by using `find_in_batches`
then enqueue jobs that take a list of ids as argument. A more efficient method is to determine the boundaries
of the batches. For instance, if the ids follow an incremental sequence between 1 to 1 000 000,
the boundaries are 1, 1001, 2001, ..., 999 991, 1 000 000.
Then a batch is determined by its lower and upper bound.

Note: I already implemented a working version for my company's codebase.
It's heavily needed for us, as Heroku shuts down workers after a 30 seconds grace period.
We also want to avoid table-level locks while performing bulk updates. Using batches also helps.

Example:
```
class MyJobLauncher < ApplicationJob
  def perform
    User.some_scope.each_batch_bounds(batch_size: 10000) do |lower_bound, upper_bound|
      MyJob.perform_later(lower_bound, upper_bound)
    end
  end
end

class MyJob < ApplicationJob
  def perform(lower_bound, upper_bound)
    do_some_heavy_lifting_with User.some_scope.batch_from_bounds(lower_bound, upper_bound)
  end
end
```

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/rubyonrails-core.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [ActiveRecord] Feature proposal: `ActiveRecord::Batches#each_batch_bounds`

Matthieu Libeer
Here are some benchmarking results to fuel the discussion. Tested on a production table with 5 million rows:
1. Execution time: Divided by 5 (if you hijack the yielded relation ...), divided by 50 to 100 if you use the yielded relation the extract the batch bounds
2. Network & Memory: Divided by ~50

On Wednesday, September 5, 2018 at 2:13:30 AM UTC+2, Matthieu Libeer wrote:
Hello :)

tldr: For efficiency, allow to split an ActiveRecord relation in batches by giving the bounds of the batches instead of an enumerator of relations.
This makes it possible, for instance, to split a table in chunks processed by parallel jobs.

Rationale: Doing some heavy lifting on a 1 000 000 rows table requires to use batches.
Indeed, loading all of them at once will trigger a memory overflow.
For that, ActiveRecord already offers `in_batches`, `find_in_batches` and `find_each`.
However, if the processing of one batch of 1000 records takes 5 seconds, processing the whole table will
take 1000 x 5 seconds ~= 1h30, which is not an acceptable duration for a job.
What you can do instead is to split the table in 1000 batches that will be processed in parallel
via a processing queue - for example.
To do so, you need to determine how to split the table into batches. You could do it by using `find_in_batches`
then enqueue jobs that take a list of ids as argument. A more efficient method is to determine the boundaries
of the batches. For instance, if the ids follow an incremental sequence between 1 to 1 000 000,
the boundaries are 1, 1001, 2001, ..., 999 991, 1 000 000.
Then a batch is determined by its lower and upper bound.

Note: I already implemented a working version for my company's codebase.
It's heavily needed for us, as Heroku shuts down workers after a 30 seconds grace period.
We also want to avoid table-level locks while performing bulk updates. Using batches also helps.

Example:
```
class MyJobLauncher < ApplicationJob
  def perform
    User.some_scope.each_batch_bounds(batch_size: 10000) do |lower_bound, upper_bound|
      MyJob.perform_later(lower_bound, upper_bound)
    end
  end
end

class MyJob < ApplicationJob
  def perform(lower_bound, upper_bound)
    do_some_heavy_lifting_with User.some_scope.batch_from_bounds(lower_bound, upper_bound)
  end
end
```

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/rubyonrails-core.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [ActiveRecord] Feature proposal: `ActiveRecord::Batches#each_batch_bounds`

Greg Navis
In reply to this post by Matthieu Libeer
I'm a bit confused with your remark about the 30 second grace period. There's a 30 second timeout on web dynos but worker dynos can work up to 24 hours + jitter (that's when Heroku's mandatory dyno cycling kicks in). Could you clarify what you mean by that?

You may run into issues if you run the Heroku Scheduler because before launching the job it terminates its previous instance if it's still working (so if you have a job that runs every 10 minutes you have 10 minutes to finish and so on).

Best regards
Greg Navis

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/rubyonrails-core.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [ActiveRecord] Feature proposal: `ActiveRecord::Batches#each_batch_bounds`

Matthieu Libeer
When shutting down, restarting, or deploying a new version of an app, all processes have 30 seconds to finish, including jobs. If a worker is stuck in a job that takes more than that, the process will receive a SIGKILL. Note that this is not the only reason why we want to do the heavy lifting in batches, locks and memory consumption are other examples.

On Wednesday, September 5, 2018 at 11:58:53 AM UTC+2, Greg Navis wrote:
I'm a bit confused with your remark about the 30 second grace period. There's a 30 second timeout on web dynos but worker dynos can work up to 24 hours + jitter (that's when Heroku's mandatory dyno cycling kicks in). Could you clarify what you mean by that?

You may run into issues if you run the Heroku Scheduler because before launching the job it terminates its previous instance if it's still working (so if you have a job that runs every 10 minutes you have 10 minutes to finish and so on).

Best regards
Greg Navis

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/rubyonrails-core.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [ActiveRecord] Feature proposal: `ActiveRecord::Batches#each_batch_bounds`

Greg Navis
Thanks for the explanation. That makes sense.

Your use cases is certainly valid but I'm on the fence whether this should be in Active Record. It should be possible to extract boundaries using #in_batches and #where_values_hash:

relation.in_batches do |relation|
  min_id, max_id = relation.where_values_hash['id'].minmax
  # Do something with the boundaries.
end

I'm not sure whether that warrants a separate method.

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/rubyonrails-core.
For more options, visit https://groups.google.com/d/optout.
Reply | Threaded
Open this post in threaded view
|

Re: [ActiveRecord] Feature proposal: `ActiveRecord::Batches#each_batch_bounds`

Matthieu Libeer
This is exactly the approach I first had :)
`in_batches` plucks the ids by batch and then builds the yielded relation by adding a where clause on those ids. With 1 millions rows, that's 1 million ids plucked. When plucking bounds only, say for batches of size 10 000, that's 200 bounds only (1 - 10000, 10001 - 20000, ..., 990 001, 1 000 000). On 5 million rows, it takes ~ 30 seconds to build the relations, ~ 5 seconds to build the bounds. I guess we could modify `in_batches` to use bounds instead, by yielding relations that apply a condition `where primary key between this and that` instead of `where primary key in those ids`. This is however a bit more prone to racing conditions, but they are inherent to batching, as the current document explains.

On Wednesday, September 5, 2018 at 2:40:51 PM UTC+2, Greg Navis wrote:
Thanks for the explanation. That makes sense.

Your use cases is certainly valid but I'm on the fence whether this should be in Active Record. It should be possible to extract boundaries using #in_batches and #where_values_hash:

relation.in_batches do |relation|
  min_id, max_id = relation.where_values_hash['id'].minmax
  # Do something with the boundaries.
end

I'm not sure whether that warrants a separate method.

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [hidden email].
To post to this group, send email to [hidden email].
Visit this group at https://groups.google.com/group/rubyonrails-core.
For more options, visit https://groups.google.com/d/optout.