findings and experiiences with scaling api infra

how we scaled janitor from node to elixir and cloudflare workers

hey everyone,

so i wanted to talk about something ive been grinding on for the past few months that nobody really sees but is the reason janitor has not caught fire despite handling billions of messages.

the problem with node

i remember one night around 2am we had 500k concurrent websocket connections and our nodejs cluster was using like 120gb of ram across 30 servers. for reference whatsapp handles 2 million connections per server with elixir. we were literally doing it wrong lol.

enter elixir

so i made the call to rewrite everything in elixir and phoenix.

the migration was brutal ngl. i basically had to run two systems in parallel for like months. but it was worth it

what used to take 30 nodejs servers now runs on 3 elixir nodes. websocket connections? i can handle millions without breaking a sweat. each connection is its own isolated process so if one crashes the supervisor just restarts it without affecting anything else. no more “oops the whole server died because someone sent a weird emoji” lmao

def handle_in("message:create", %{"content" => content}, socket) do
  broadcast!(socket, "new_message", %{content: content})
  {:noreply, socket}
end

that is how simple real time messaging is in phoenix.

but wait there is more - cloudflare workers

elixir solved my real-time problems but i still had all these stateless apis that didnt need the beam vm overhead. stuff like character search, fetching bot definitions, user profiles, basic crud stuff that i just want fast.

enter the cloudflare workers.

moved more endpoints to be workers on the edge running closer to users. honestly shouldve done this from the start

heres what a worker looks like i suppose:

export default {
  async fetch(request, env) {
    const character = await env.KV.get(characterId);
    if (character) return new Response(character);

    const data = await env.DB.prepare("SELECT * FROM characters WHERE id = ?")
      .bind(characterId)
      .first();

    await env.KV.put(characterId, JSON.stringify(data));
    return Response.json(data);
  },
};

the database migration hell

ok before i talk about the architecture let me tell you about the database situation. i had over 20 billion rows in my chats table. another 100b in my the table. all partitioned tables with hundreds of chunks

migrating this much data is insane. heres how i did it for now:

i wrote a go script using pgx to handle the parallel processing for exports. honestly go is goated for this type of stuff

func migrateBatch(pool *pgxpool.Pool, startId, endId int64) {
    // export batch to csv
    // compress with gzip
    // import to new table
    // verify counts match
    // drop from old table
}

// run 50 parallel workers
for i := 0; i < 20; i++ {
    go migrateBatch(pool, i*batchSize, (i+1)*batchSize)
}

took 3 days of weird uptime (probably saw a few queue pages if you know you know lol) but i got it done. some lessons:

do not use cdc on billion row tables - the wal retention just doesnt work
batch everything - 50m rows at a time worked for me
verify as you go - i had checksums for every batch because paranoia
have a rollback plan - i kept backups of everything (obviously)

the architecture now

the migration took 4 months of basically no sleep. i had to:

rewrite 100+ api endpoints
migrate billions of database rows without downtime (mostly lol)
partition tables while keeping them online
keep both systems running in parallel with gradual rollout
not break anything for millions of users

was it worth it?

well kinda yea. our infrastructure costs dropped a bit. response times improved a bit. we can actually sleep at night knowing the site will not randomly die for now. keyword = for now, i still 100% expect us to encounter scaling challenges but it will be much less of a challenge to fix with how nice this infra is working now, but its so much funner making real progress on these issues and honestly just so glad to be working on something so fun!

and the real win? we can finally build the features we been promising. real time features with hundreds of people. analytics. instant notifications for everything. a/b testing. all the stuff that was impossible when we were fighting memory leaks at 3am…

what is next

the lesson here? dont be afraid to throw everything out and start over if your architecture is holding you back. wasted a year trying to make nodejs scale. should have switched to elixir sooner. and definitely should have used cloudflare stack from day one

if you are curious about the technical details or want to know more about how i did the migration without ~much downtime let me know. always happy to share what i learned the hard way

thanks for reading and thanks for sticking with janitor through all the growing pains.

shep