internal/postgres: enqueue modules in a random order

For requeuing, we were sorting modules first by importance of
reprocessing, then by module path. That clumped all versions of the
same module together (at least all those that needed reprocessing).

The main problem with reprocessing multiple versions of the same
module was that if the load shedder was off about the size, that error
would persist while all versions of the module were being processed.

For example, github.com/elastic/beats/v7 has an 86Mi zip file, but
only a few Ki of actual Go files. So a node with a 100Mi threshold
would shed one version if it was processing another. This frequent
shedding slowed the task queue rate to a crawl.

This CL effectively randomizes the modules that are enqueued,
by hashing the name and version.

This small change has had a dramatic effect on processing.

Workers are now living for many minutes, even hours, before OOMing;
before, even with a load-shed threshold of 100Mi they would still
OOM in a half hour or so.

Before, processing was erratic: bursts of fast progress interspersed
with intervals where processing slowed, as many versions of the same
module were processed together, where the module was large or had
other problems, like the the elastic/beats module discussed above.

Now, processing is uniform and fast, with 8 workers processing about 5
packages/sec and very few sheds.

Change-Id: Id4f2010f7ab3131d4b4e37721cab6d5ff1680a54
Reviewed-on: https://go-review.googlesource.com/c/pkgsite/+/258018
Reviewed-by: Julie Qiu <julie@golang.org>
Trust: Jonathan Amsterdam <jba@google.com>
diff --git a/internal/postgres/requeue.go b/internal/postgres/requeue.go
index 1761908..d4b142f 100644
--- a/internal/postgres/requeue.go
+++ b/internal/postgres/requeue.go
@@ -198,6 +198,14 @@
 	LIMIT $2
 `
 
+// Like the standard query above, this query prioritizes latest versions.
+// However, it does not hold off large modules until the end. In fact, it tries
+// to avoid grouping modules in any way except by latest and status code:
+// processing is much smoother when they are enqueued in random order.
+//
+// To make the result deterministic for testing, we hash the module path and version
+// rather than actually choosing a random number. md5 is built in to postgres and
+// is an adequate hash for this purpose.
 const nextModulesToProcessQueryAlt = `
     -- Make a table of the latest versions of each module.
 	WITH latest_versions AS (
@@ -237,7 +245,6 @@
 			WHEN status = 540 OR status = 541 OR status = 542 THEN 4
 			ELSE 5
 		END,
-		module_path,
-		version -- for reproducibility
+		md5(module_path||version) -- deterministic but effectively random
 	LIMIT $2
 `