From Thousands of Files to Just Seven: A Data Lakehouse Optimization Story
November 26, 2025 · By · Filed underAt Endowus, data is at the heart of everything we do, from personalizing client experiences to powering our investment insights. As our platform grows, so does our data lakehouse, and with that growth comes the inevitable engineering challenges of managing scale, cost, and performance. Recently, our Data Platform team embarked on a mission to tackle two critical issues that were impacting our data pipelines: the notorious “small file problem” and the crucial challenge of data freshness.
This post shares our journey of diagnosing these challenges, implementing a multi-faceted solution using Delta Lake, and the dramatic improvements we achieved in performance, cost, and reliability. It’s a story about how rethinking the physical layout of our data unlocked efficiencies across the entire platform.
Our Data Platform: A Quick Overview
Before diving into the challenges, it’s helpful to understand the key technologies powering our data platform - the foundations that enable our team to deliver reliable, performant data products at scale across Singapore and Hong Kong.
- Data Lakehouse with Medallion Architecture: At the heart of our platform is a data lakehouse built on Delta Lake, backed by Amazon S3, organized using the Medallion Architecture - a three-tier design (Bronze → Silver → Gold) that progressively refines data from raw extracts to analytics-ready tables.
- Data Processing with Apache Spark: Apache Spark with Scala powers our data transformations, deployed on AWS Glue and EMR Serverless, with more than 100 jobs running daily across SG and HK.
- Orchestration & Workflow Management: Apache Airflow on AWS EKS orchestrates 150+ production DAGs across our pipeline ecosystem.
With this technical foundation in place, let’s explore the challenges that emerged as our platform scaled.
The Challenge: Small Files, Big Problems
Our data platform was struggling under the weight of its own success. The core of the problem lay in one of our largest and most frequently accessed table: the transaction table. It was fragmented into over 1,800 small files, a classic symptom of the “small file problem” in big data systems where data is stored in many small, inefficient chunks instead of fewer, larger ones.
This fragmentation triggered a cascade of issues that radiated outward, affecting everything from query speed to our monthly cloud bill:
- Degraded Performance: Queries against the table were painfully slow. Before a query engine can even read the data, it must perform costly “list” operations to discover all the files and then open each one to read its metadata. With thousands of files, this I/O overhead alone created a significant bottleneck, delaying critical analytics and reporting.
- Increased Costs: We were at risk of hitting the request limit for AWS Key Management Service (KMS), which we use for envelope encryption on our S3 data. Each encrypted file requires a separate API call to KMS to decrypt its unique data key. This led to a peak of over 87,000 KMS requests for a single query. This not only incurred direct KMS costs, but also consumed a large portion of our account’s KMS request quota.
- Pipeline Instability: The sheer volume of KMS requests created a constant risk of API throttling, which threatened the stability of our entire data pipeline. A single complex query could trigger a cascade of failures in downstream jobs, leading to operational firefighting and delays in data availability for the rest of the business.
The root cause was a combination of factors common in evolving data architectures:
- Excessive Partitioning: Our initial approach of using daily Hive-style partitioning (
/year=YYYY/month=MM/day=DD) seemed logical, but created a new folder and at least one new file for every day’s data, regardless of whether we had ten transactions or ten thousand. - Inefficient Incremental Writes: Frequent micro-batching processes, a common pattern in streaming and near-real-time ingestion, were continuously adding to the proliferation of small files without a corresponding compaction strategy.
- Suboptimal Data Layout: The physical layout of the data, dictated by the rigid daily partitioning, didn’t allow for effective “data skipping.” For example, a query for a single user’s transactions would still require the engine to list and inspect files across hundreds of daily partitions, scanning far more data than necessary.
Compounding this, we faced a data freshness and reliability problem. Our original implementation dated back to the earliest days of Endowus. At that time, we took the simplest and most pragmatic approach of loading data in the warehouse: refresh the last six months of data every six hours. This blunt approach was quite reasonable in the early days when volumes were low. But as the business scaled, not only did it become computationally quite expensive, it also brought on the challenge of missing any updates to records older than six months. This led to data staleness, requiring manual backfills and eroding trust in the data.
Our Solution: A Two-Pronged Approach with Delta Lake
To solve these interconnected problems, we turned to the advanced capabilities of Delta Lake 3.1+. Our solution focused on two key areas: fundamentally optimizing the physical data layout and completely redesigning our data ingestion process for efficiency and accuracy.
Part 1: Solving the Small File Proliferation
Our first goal was to drastically reduce the number of files. We moved away from rigid, time-based partitioning and adopted a more dynamic, query-aware strategy using Delta Lake’s built-in features.
- Departitioning & Liquid Clustering: The first step was bold: we removed the Hive-style daily partitioning entirely. In its place, we implemented Delta Lake’s Liquid Clustering. Unlike traditional partitioning, which is limited to a few low-cardinality columns, liquid clustering allows for flexible, multi-dimensional clustering of data. It reorganizes the data layout based on the columns most frequently used in query
WHEREclauses. For our transaction table, we clustered ontransaction_typeandcreated_date. This ensures that related data is physically co-located within the same files, allowing Delta Lake’s engine to perform highly effective data skipping and dramatically speeding up queries.
- Optimized Writes: Traditional writes in distributed systems can be chaotic, with multiple parallel workers each creating their own small files. Delta Lake’s Optimized Writes feature solves this by adding an extra shuffle step before the write operation. This shuffle intelligently repartitions and sorts the data in memory across the cluster, ensuring that a minimal number of larger, well-packed files are written to storage from the outset.
- Scheduled Compaction: To maintain this optimized state over time as new data is ingested, we implemented a scheduled job in Airflow that runs the
OPTIMIZEcommand. This command intelligently identifies partitions with small files and compacts them into larger, 1GB files. This periodic housekeeping ensures our table remains performant without manual intervention.
Part 2: Ensuring Data Freshness with MERGE
To solve the data reliability issue, we replaced our brute-force six-month refresh with Delta Lake’s powerful MERGE command, effectively creating an intelligent “upsert” logic.
The MERGE command allows us to atomically update, delete, or insert data into a target table based on a source table and a specified key. We use transaction_id as our merge key. When a new batch of data arrives, the MERGE operation efficiently performs the equivalent of two joins in a single, ACID-compliant transaction:
- An inner join (
WHEN MATCHED) to find records that already exist in the target table and require anUPDATE. - An outer join (
WHEN NOT MATCHED) to find new records that are not in the target table and require anINSERT.
This surgical approach ensures that any update, even to a record from years ago, is accurately captured and reflected in the table. We no longer waste compute cycles re-processing unchanged data, and we’ve eliminated the risk of missing historical updates.
The Results: Dramatic Improvements Across the Board
The impact of these changes was immediate and profound, validating our approach and delivering value across performance, cost, and reliability.
On the Small File & KMS Front
- 99.9% File Count Reduction: The number of files in the transaction table dropped from ~1,800 to just 7. This is the difference between asking the query engine to check a filing cabinet versus a whole library.
- 99.6% KMS Request Reduction: Consequently, KMS requests per query plummeted from over 87,000 to an average of just 351.
- 434$ Monthly Cost Savings: We saved $434 per month in AWS spend from optimizing just a single table.
On the Data Freshness & Reliability Front
By switching from a full refresh to the surgical MERGE operation, we saw a 79% reduction in overall resource consumption for our data ingestion pipeline in EMR Serverless.
- EMR Runtime: The pipeline’s runtime was halved, from over 6 minutes to just 3 minutes, delivering fresher data to the business faster.
- vCPU & Memory Usage: Core resource utilization was reduced by 79%, significantly lowering the cost of each pipeline run and reducing our overall cloud footprint.
Key Learnings and What’s Next
Our biggest takeaway from this initiative was a powerful reminder: optimizing the underlying physical data layout can solve multiple, seemingly unrelated problems simultaneously. By tackling the small file problem, we not only improved query performance but also slashed operational costs and fortified our pipeline’s stability. It’s a foundational concern, not just an implementation detail.
This project is a blueprint for future optimizations across our entire data lakehouse. Our next steps include:
- Expanding Coverage: Systematically applying these clustering and compaction techniques to other high-query, high-volume tables like orders, redemptions, and investments.
- Integrating into ETL: Building optimization directly into our data pipelines, triggering compaction automatically after ingestion jobs, rather than relying on separate maintenance schedules.
- Continuous Monitoring: Actively tracking key metrics like file counts, average file size, and query latency patterns to proactively adapt our clustering strategies as our data and access patterns evolve.
- Evaluating Emerging Tech: Keeping a close eye on new features in Delta Lake, Apache Iceberg, and other technologies to continuously enhance our data platform’s capabilities.
We’re excited by these results and are committed to building a data platform that is not only powerful but also efficient, reliable, and scalable.