<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Behind Eshopbox]]></title><description><![CDATA[Insights from the Eshopbox team on how we build, scale, and automate commerce—covering engineering, infrastructure, AI, and product decisions.]]></description><link>https://behind.eshopbox.com</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1756547053979/850b4c22-974f-4468-bbe4-13cde733777b.png</url><title>Behind Eshopbox</title><link>https://behind.eshopbox.com</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 25 Apr 2026 01:45:51 GMT</lastBuildDate><atom:link href="https://behind.eshopbox.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[How Percona Toolkit Helped Us Achieve Zero-Downtime Schema Changes]]></title><description><![CDATA[At Eshopbox, database schema changes used to be one of the most dreaded activities. Even a small DDL operation would mean:

Stopping all crons before the change and restarting them after — pulling in multiple teams at odd hours.

Downtime that scaled...]]></description><link>https://behind.eshopbox.com/how-percona-toolkit-helped-us-achieve-zero-downtime-schema-changes</link><guid isPermaLink="true">https://behind.eshopbox.com/how-percona-toolkit-helped-us-achieve-zero-downtime-schema-changes</guid><category><![CDATA[Databases]]></category><category><![CDATA[10xengineer]]></category><category><![CDATA[SQL]]></category><dc:creator><![CDATA[Faisal]]></dc:creator><pubDate>Fri, 29 Aug 2025 04:36:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1756445855071/f051cc04-eb2b-42df-b31b-3d8fe56319f8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At Eshopbox, database schema changes used to be one of the most dreaded activities. Even a small DDL operation would mean:</p>
<ul>
<li><p><strong>Stopping all crons</strong> before the change and restarting them after — pulling in multiple teams at odd hours.</p>
</li>
<li><p><strong>Downtime that scaled with table size</strong> — sometimes hours.</p>
</li>
<li><p><strong>Replication lag nightmares</strong> — master-slave sync issues would pop up even after the schema change was complete.</p>
</li>
</ul>
<p>Clearly, this was not scalable.</p>
<p>That’s when we discovered <strong>Percona Toolkit</strong>, specifically the <code>pt-online-schema-change</code> utility.</p>
<hr />
<h2 id="heading-why-percona-toolkit">Why Percona Toolkit?</h2>
<p>The magic of <code>pt-online-schema-change</code> is simple but powerful:</p>
<ol>
<li><p>It <strong>creates a copy of the target table</strong>.</p>
</li>
<li><p>Applies the schema change (ALTER, INDEX, etc.) on the copy.</p>
</li>
<li><p>Sets up <strong>triggers</strong> to replicate ongoing writes to both tables.</p>
</li>
<li><p>Once ready, <strong>swaps the old table with the new one</strong>, transparently.</p>
</li>
</ol>
<p>The result? <strong>No downtime</strong>, even for large tables.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756441457401/0003446d-9cec-4424-9a25-9cfc3df018bd.png" alt class="image--center mx-auto" /></p>
<hr />
<h2 id="heading-challenge-how-do-we-test-this">Challenge: How Do We Test This?</h2>
<p>The tricky part: <strong>staging traffic ≠ production traffic</strong>.</p>
<ul>
<li><p>Our staging had crons running, but nowhere near enough user traffic to replicate real-world conditions.</p>
</li>
<li><p>We needed to test how the schema change behaves with <strong>continuous inserts and updates</strong>.</p>
</li>
</ul>
<p>The solution:</p>
<ul>
<li><p>We used <strong>Postman scripts</strong> to generate artificial traffic, continuously inserting rows into the target table.</p>
</li>
<li><p>We bulk-inserted <strong>dummy data</strong> to simulate a large dataset closer to production.</p>
</li>
</ul>
<p>This gave us confidence that the tool would behave as expected under load.</p>
<hr />
<h2 id="heading-step-by-step-how-we-did-it">Step-by-Step: How We Did It</h2>
<h3 id="heading-1-create-a-vm-instance">1. Create a VM Instance</h3>
<p>We provisioned a dedicated VM for running Percona Toolkit:</p>
<ul>
<li><p>Machine: <code>e2-standard-2</code> (2 vCPUs, 8 GB Memory)</p>
</li>
<li><p>OS: Ubuntu 22.04.5 LTS</p>
</li>
<li><p>Storage: 20 GB</p>
</li>
<li><p>Labels for easy identification</p>
</li>
</ul>
<p>This ensures a <strong>controlled, isolated environment</strong> for running schema changes without affecting app servers.</p>
<hr />
<h3 id="heading-2-install-and-configure-cloud-sql-proxy">2. Install and Configure Cloud SQL Proxy</h3>
<p>Since our MySQL database runs on <strong>GCP Cloud SQL</strong>, direct connections aren’t always straightforward.<br />We installed <strong>Cloud SQL Proxy</strong> to establish a secure tunnel and access the DB via <a target="_blank" href="http://localhost"><code>localhost</code></a>.</p>
<p>This step is critical for:</p>
<ul>
<li><p>Security (no exposing DB IPs publicly).</p>
</li>
<li><p>Convenience (simpler connection strings).</p>
</li>
</ul>
<hr />
<h3 id="heading-3-set-database-flag">3. Set Database Flag</h3>
<p>We enabled:</p>
<pre><code class="lang-plaintext">log_bin_trust_function_creators = ON
</code></pre>
<p><strong>Why?</strong><br />Percona Toolkit creates triggers and temporary functions. By default, Cloud SQL blocks such operations unless this flag is enabled. Without it, the tool fails. You can also do the same from the Google Cloud console interface.</p>
<hr />
<h3 id="heading-4-prevent-session-timeout-with-tmux">4. Prevent Session Timeout with <code>tmux</code></h3>
<p>Schema migrations can run for <strong>hours</strong> depending on table size. If your SSH session dies, you don’t want the migration to die with it.</p>
<p>That’s why we ran Percona Toolkit inside <code>tmux</code>:</p>
<pre><code class="lang-plaintext">tmux new -s ptosc
</code></pre>
<ul>
<li><p>Detach (<code>Ctrl + B, D</code>) to keep it running.</p>
</li>
<li><p>Reattach anytime with <code>tmux attach -t ptosc</code>.</p>
</li>
</ul>
<p>This small step saved us a lot of anxiety.</p>
<hr />
<h3 id="heading-5-run-schema-change-with-pt-online-schema-change">5. Run Schema Change with <code>pt-online-schema-change</code></h3>
<p>Here’s the exact command we ran (staging first, then production):</p>
<pre><code class="lang-plaintext">pt-online-schema-change \
  --alter "ADD INDEX idx_return_status_updated_at (return_status, external_updated_at)" \
  --charset=latin1 \
  --no-version-check \
  --recursion-method=none \
  --chunk-time=2 \
  --chunk-size-limit=2000 \
  --max-load "Threads_running=30" \
  --critical-load "Threads_running=40" \
  --max-lag=10 \
  --check-slave-lag h=&lt;replica-host&gt;,P=3306,u=&lt;user&gt;,p=&lt;password&gt; \
  --check-interval=5 \
  --progress time,10 \
  --alter-foreign-keys-method=auto \
  --preserve-triggers \
  --pause-file=/tmp/ptosc.pause \
  --execute \
  --set-vars "lock_wait_timeout=600,innodb_lock_wait_timeout=60,wait_timeout=28800,interactive_timeout=28800,net_read_timeout=600,net_write_timeout=600,max_allowed_packet=1073741824" \
  h=&lt;master-host&gt;,P=3306,u=&lt;user&gt;,p=&lt;password&gt;,D=eshopbox_wms_production,t=return_shipment_status_logs
</code></pre>
<hr />
<h3 id="heading-explaining-key-flags">Explaining Key Flags</h3>
<p>Here’s what each flag does (and why we used it):</p>
<ul>
<li><p><code>--alter</code> → The schema change. In our case, adding a composite index.</p>
</li>
<li><p><code>--charset=latin1</code> → Matches our table’s charset to avoid encoding mismatches.</p>
</li>
<li><p><code>--no-version-check</code> → Skips version check to avoid interruptions (useful in CI/CD).</p>
</li>
<li><p><code>--recursion-method=none</code> → Disables automatic slave discovery; we explicitly provided replica host.</p>
</li>
<li><p><code>--chunk-time=2</code> → Each data chunk should take ~2s. Balances migration speed vs. load.</p>
</li>
<li><p><code>--chunk-size-limit=2000</code> → Prevents overly large chunks that can lock tables.</p>
</li>
<li><p><code>--max-load "Threads_running=30"</code> → If DB has &gt;30 active threads, pause migration. Protects production.</p>
</li>
<li><p><code>--critical-load "Threads_running=40"</code> → If threads exceed 40, migration aborts immediately. Safety net.</p>
</li>
<li><p><code>--max-lag=10</code> → If replica lag exceeds 10s, migration pauses. Ensures replication health.</p>
</li>
<li><p><code>--check-slave-lag ...</code> → Host details of replica to monitor lag.</p>
</li>
<li><p><code>--check-interval=5</code> → Checks system health every 5 seconds.</p>
</li>
<li><p><code>--progress time,10</code> → Prints progress every 10s for monitoring.</p>
</li>
<li><p><code>--alter-foreign-keys-method=auto</code> → Handles foreign keys automatically.</p>
</li>
<li><p><code>--preserve-triggers</code> → Ensures existing triggers remain intact.</p>
</li>
<li><p><code>--pause-file=/tmp/ptosc.pause</code> → Migration pauses if this file exists. Handy kill-switch.</p>
</li>
<li><p><code>--set-vars ...</code> → Overrides MySQL system variables for smoother long-running ops.</p>
</li>
</ul>
<p>⚖️ <strong>Tip for tuning flags</strong>:</p>
<ul>
<li><p>Start with conservative values (<code>chunk-time</code>, <code>max-load</code>, <code>max-lag</code>).</p>
</li>
<li><p>Run in staging with traffic simulation.</p>
</li>
<li><p>Adjust iteratively based on monitoring (CPU, replication lag, query latency).</p>
</li>
</ul>
<hr />
<h2 id="heading-execution-on-production">Execution on Production</h2>
<p>After verifying everything on staging with traffic simulation, we confidently ran the exact command on production.</p>
<p>✅ Results:</p>
<ul>
<li><p>No downtime.</p>
</li>
<li><p>Replication lag stayed within safe limits.</p>
</li>
<li><p>Teams didn’t have to pause crons or wake up at odd hours.</p>
</li>
</ul>
<hr />
<h2 id="heading-lessons-learned">Lessons Learned</h2>
<ul>
<li><p><strong>Always test on staging</strong> with realistic traffic.</p>
</li>
<li><p><strong>Monitoring is key</strong> — keep an eye on replication lag, threads running, and system load.</p>
</li>
<li><p><strong>Have a rollback plan</strong> — <code>--pause-file</code> and conservative flags saved us from surprises.</p>
</li>
<li><p>Percona Toolkit is powerful, but <strong>safe only if configured properly</strong>.</p>
</li>
</ul>
<hr />
<h2 id="heading-whats-next">What’s Next</h2>
<p>We’re now integrating <code>pt-online-schema-change</code> into our <strong>database migration pipeline</strong> so that schema changes are:</p>
<ul>
<li><p>Version-controlled</p>
</li>
<li><p>Tested automatically in staging</p>
</li>
<li><p>Rolled out with minimal human intervention</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>