Engineering
100x Performance: How We Optimized a Legacy Codebase Without a Full Rewrite
A practical guide to profiling, identifying bottlenecks, and making targeted optimizations that delivered 100x performance gains in a scientific computing application — without starting from scratch.

The Situation
A client came to us with a scientific computing application that was too slow. Their engineers were waiting 45 minutes for calculations that should take seconds. The application had been built over several years in Python, and the team’s instinct was that they needed a complete rewrite in a faster language.
Complete rewrites are expensive, risky, and often unnecessary. Before writing a single line of new code, we profiled the application to understand where the time was actually going.
Step 1: Profiling
We instrumented the application with Python’s cProfile module and ran it against a representative set of real-world inputs. The profiling data told a different story than the team expected.
The application had roughly 200 functions. Of those:
- 3 functions consumed 87% of total runtime
- 12 functions consumed another 10%
- 185 functions consumed the remaining 3%
The three hot functions were:
- A thermodynamic property lookup that was being called millions of times per calculation, each time re-initializing the fluid model from scratch
- An iterative solver that used a fixed step size, requiring thousands of iterations to converge even when the solution was close
- A results formatting function that converted every number to a string, formatted it, and immediately parsed it back to a number in the next step
The team had assumed the bottleneck was in the matrix operations and had been considering rewriting those in C. The matrix operations accounted for 2% of runtime.
Step 2: Algorithmic Improvements
The biggest performance gains in any optimization project come from algorithmic changes, not implementation changes. Making a slow algorithm run faster in a faster language is still running a slow algorithm.
Caching the Fluid Model
The thermodynamic property lookup was calling the REFPROP library to initialize a fluid model on every single invocation. Initializing the model involves parsing fluid data files, computing interaction parameters, and allocating internal state. This took about 50ms per call — and the function was being called 200,000+ times per calculation.
The fix: cache the initialized fluid model and reuse it across calls. If the composition has not changed (which it had not — it was the same fluid for the entire calculation), skip the initialization entirely.
This single change cut total runtime by approximately 60%.
Adaptive Solver Step Size
The iterative solver used a fixed step size to search for the solution. For cases where the solution was far from the initial guess, this was fine — the step size was appropriate. But for cases where the solution was close, the solver was taking thousands of tiny, unnecessary steps.
We replaced the fixed step size with an adaptive algorithm that starts with large steps and reduces the step size as it approaches convergence. Specifically, we implemented a bisection method with an initial bracketing phase, followed by Brent’s method for the final convergence.
This reduced average iteration counts from ~5,000 to ~50 — a 100x reduction in solver iterations. Combined with the caching fix, total runtime dropped by about 90%.
Eliminating Redundant Conversions
The results formatting function was converting floating-point numbers to formatted strings for display, then immediately parsing them back to floats for the next calculation step. This pattern appeared because the code had been written incrementally — the formatting was originally the final step, and later calculations were added after it without refactoring the pipeline.
We restructured the pipeline to keep numbers in their native representation throughout the calculation chain and only format them at the very end, for output. This eliminated millions of unnecessary string-to-float round trips and removed a subtle source of precision loss.
Step 3: Language Migration (Targeted)
After the algorithmic improvements, we profiled again. The application was now about 10x faster than the original. The remaining bottleneck was the innermost computation loop — a tight loop performing thermodynamic calculations that ran millions of iterations.
Python is not well-suited for tight numerical loops. The interpreter overhead per iteration is significant, and even with NumPy, the loop structure prevented vectorization.
We extracted the inner loop into a Go module. Go was chosen because:
- The client’s team was already using Go for other projects
- Go’s compilation to native code eliminates interpreter overhead
- Go’s concurrency model made it straightforward to parallelize independent calculations
- The FFI (Foreign Function Interface) between Go and the REFPROP C library is clean
The Go module handles only the hot path — the tight computation loop that runs millions of iterations. Everything else (input parsing, job orchestration, results aggregation, reporting) stays in Python. This targeted migration delivered an additional 20x speedup on the most expensive calculations.
Step 4: Parallelization
With the inner loop running in Go, we added concurrency. Many of the calculations in a batch are independent of each other — they can run in parallel without coordination.
We implemented a worker pool pattern in Go: a fixed number of goroutines pull calculation tasks from a channel and process them concurrently. The pool size is configurable and defaults to the number of available CPU cores.
For batch workloads, this delivered a near-linear speedup with core count. On an 8-core machine, batch processing was approximately 7x faster than single-threaded execution (the sub-linear scaling is due to memory bandwidth and cache contention on the REFPROP library’s internal state).
The Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Single calculation | 45 min | 27 sec | ~100x |
| Batch of 1,000 | Not feasible | 18 min | N/A |
| Memory usage | 2.1 GB | 340 MB | 6x reduction |
| Code rewritten | — | ~15% | — |
The 100x improvement came from the combination of:
- Caching (60% runtime reduction)
- Adaptive solver (90% of remaining runtime reduction)
- Redundant conversion elimination (minor but measurable)
- Go migration of hot path (20x on computation-intensive cases)
- Parallelization (7x on batch workloads)
Only about 15% of the codebase was actually rewritten. The rest was left untouched.
Takeaways
Profile first. Always. Your intuition about where the bottleneck is will be wrong. The data will tell you exactly where to focus.
Fix the algorithm before fixing the implementation. Caching and adaptive step sizing delivered 10x before we changed a single line of implementation code. No amount of low-level optimization would have matched that.
Migrate only the hot path. A full rewrite would have taken 6+ months and introduced risk. Migrating 15% of the code to Go took 3 weeks and delivered the same (or better) performance benefit.
Measure after every change. We profiled after every optimization step to confirm the improvement and identify the next bottleneck. Optimization without measurement is guessing.
The client expected a months-long rewrite project. We delivered 100x performance in 6 weeks by being disciplined about profiling, making targeted changes, and resisting the urge to rewrite things that were not actually slow.
- Profile before you optimize. Intuition about where the bottlenecks are is almost always wrong. We profiled the application end-to-end and found that 87% of runtime was spent in three functions — none of which were the ones the team suspected.
- Algorithm over implementation. The biggest gains came from algorithmic changes, not micro-optimizations. Replacing a brute-force solver with an adaptive algorithm cut iteration counts by 10x before we touched a single line of implementation code.
- Language migration where it matters. For the computation-critical inner loop, we migrated from Python to Go — not the entire application, just the hot path. This alone delivered a 20x speedup on the most expensive calculations.
"The client expected a 5x improvement and thought we would need to rewrite the whole thing. We delivered 100x by profiling first and only rewriting the parts that mattered."