Chasing 500 Errors in Production — A Developer's Field Notes

Nothing prepares you for a production bug quite like the moment a stakeholder sends you a screenshot of a white screen with "Internal Server Error" and the question "is this supposed to happen?" It happened to me three times in the first two months of the GeriCare DNB Portal being live. Each one was different. Each one taught me something.

These are my actual field notes — the specific bugs, how I found them, and how I fixed them.

Bug 1: The silent database connection pool exhaustion

Symptoms: The app worked fine for hours, then all API calls started returning 500s simultaneously. Restarting the Cloud Run service fixed it temporarily. The logs showed nothing obvious.

The actual cause: my PostgreSQL connection pool was set to max: 20. The Cloud Run service was running 3 instances at peak load. That's potentially 60 connections against a Neon DB free-tier database with a 20-connection limit. The pool didn't throw an error — it just timed out silently waiting for a connection that would never become available.

// ❌ Before — too many connections
const pool = new Pool({ connectionString, max: 20 });

// ✅ After — accounts for horizontal scaling
const MAX_INSTANCES = 3;  // Cloud Run max-instances setting
const DB_LIMIT = 20;        // Neon free tier

const pool = new Pool({
  connectionString,
  max: Math.floor(DB_LIMIT / MAX_INSTANCES) - 1, // = 5
  connectionTimeoutMillis: 5000,
  idleTimeoutMillis: 30000
});

When you deploy to a platform that scales horizontally, you need to think about total connections across all instances, not just per-instance pool size.

Bug 2: The wrong column name in a dynamic query

Symptoms: One specific API endpoint returned 500 in production but worked perfectly in development. The error in logs was column "traineeId" does not exist.

The cause: PostgreSQL automatically lowercases unquoted identifiers. My development database had been created with a migration that used "traineeId" (quoted, preserving case). The production database had been created from a script that used traineeId without quotes — so Postgres stored it as traineeid. My query used traineeId, which worked on dev but failed on prod.

-- ❌ Inconsistent — relies on how the column was created
SELECT * FROM logbook_entries WHERE "traineeId" = $1;

-- ✅ Consistent — use snake_case everywhere, no quoting needed
SELECT * FROM logbook_entries WHERE trainee_id = $1;

Use snake_case for all PostgreSQL identifiers. It avoids the quoting mess entirely and is the Postgres convention anyway.

Bug 3: Unhandled promise rejection bringing down the server

Symptoms: The entire Cloud Run instance crashed and restarted. Users mid-session got disconnected. The Cloud Run logs showed UnhandledPromiseRejectionWarning followed by a process exit.

The cause: a background cron job that ran every hour to generate a report was making a database query. When Neon's serverless database was in its "sleep" state, the connection took longer than the default timeout. The promise rejected. There was no try/catch around it. Node.js's unhandled rejection killed the process.

// ❌ Before — no error handling
setInterval(async () => {
  const report = await generateReport(); // could reject
  await saveReport(report);
}, 60 * 60 * 1000);

// ✅ After — always wrap async work
setInterval(async () => {
  try {
    const report = await generateReport();
    await saveReport(report);
  } catch (err) {
    console.error('Report generation failed:', err.message);
    // log and continue — don't crash the process
  }
}, 60 * 60 * 1000);

Every async function that runs outside of an Express request handler needs its own try/catch. Express catches errors in route handlers. It does not catch errors in background timers, event listeners, or spawned processes.

The debugging process that works

Across all three bugs, the process was the same:

Get the exact error message — not "it's broken", but the precise string from the logs
Reproduce it in a controlled way — can you make it fail on demand?
Narrow the scope — is it this endpoint? This user? This data? This time of day?
Form one hypothesis at a time — change one thing, observe the result
Fix the root cause, not the symptom — restarting the service fixed bug 1 temporarily, but the fix was reducing pool size

Production bugs are uncomfortable but they're also the best teachers. Each one of these pushed me to understand Node.js, PostgreSQL, and GCP at a level that reading documentation never would have. The discipline of always wrapping async background work in try/catch is now muscle memory — because I know exactly what happens when you don't.