EDW to Azure Databricks Migration

ABCloudz recently completed a significant project migrating an Enterprise Data Warehouse (EDW) from an on-premises SQL Server 2014 to Azure Databricks for a leading global retailer operating 3,000 offline stores across 40 countries. This migration was essential for addressing physical server limitations and outdated technology that hindered data processing and storage capabilities. Embracing Big Data capabilities, our solution leveraged Azure Databricks to enhance the client’s data scalability and processing efficiency, directly impacting their ability to manage and analyze large data volumes effectively.

Discover how this migration was executed and the improvements it delivered by continuing to the next sections.

Background and challenges of the migration project

The need to migrate an Enterprise Data Warehouse (EDW) to Azure Databricks arose from critical challenges faced by a global retailer with extensive operations.
Difficult integration of new tools: Various teams within the client’s large organizational structure wanted to implement new tools tailored to their specific needs, but the old infrastructure couldn’t easily support these demands, leading to delays and inefficiencies.
Retrieving previously unextracted data for new reports: Creating new reports that required previously unextracted raw data was difficult. It took significant time and effort to pull this data from the source systems, with no guarantee that the needed historical data was still available, making it challenging to analyze past trends.
Infrastructure limitations: Originally constructed on an SQL server 15 years ago, the on-premises EDW system began to falter under its aging infrastructure, marked by diminishing storage capacities and slower data processing speeds, prompting an urgent need for modernization.
Documentation challenges: A significant hurdle was the absence of unified, comprehensive documentation. The existing system, over 15 years old, lacked any standard documentation practices, with crucial data and system changes scattered across various segments without cohesive integration.

Understanding the initial system architecture

Before embarking on our migration project, our client operated a complex and extensive data ecosystem that was pivotal to their global retail operations. This system consisted of various interconnected components, each serving specific functions crucial for daily operations and strategic decision-making.

Click or tap on the picture to open it in full size

This architectural diagram shows only the most essential source components, but there are actually many more. Let’s take a look at each component.

ATG: The ATG component, implemented on the Oracle ATG Web Commerce platform, serves as the core for our client’s e-commerce websites across various global markets. This component effectively manages data inflow from mobile applications, online marketplaces (each adapted to the regional assortments and promotions of different countries), and the Customer Support Center (CSC).

Order Management System (OMS): The Order Management System centralizes all online commerce transactions for our client’s global retail operations. It efficiently processes order data from various online marketplaces.

Point of Sales (POS): The Point of Sales system, integral to our client’s offline retail operations, runs on a DB2 database hosted on a Mainframe. This system handles all in-store transactions at cash registers, managed by live cashiers. It supports complex transactional processes including promotions, loyalty points accumulation, and order management, such as merges and deliveries.

Sales Audit: The Sales Audit system, operating on the Aptos platform, captures and verifies transaction data from all sales channels—both online (via OMS) and offline (via Point of Sales). It ensures every transaction, including sales, promotions, and taxes, is completed and payment is received before logging. Only confirmed transactions are reported, filtering out cancellations and uncompleted orders. Data is then securely transferred to the Enterprise Data Warehouse via SFTP, maintaining consistency and accuracy for financial reporting and analysis.

Traffic: The Traffic component of the system utilized People counter systems provided by SenSource, leveraging visual sensors to track the number of customers entering and exiting offline stores. These metrics were integrated into broader retail analytics to help forecast future sales and store performance. Data updates were sent to the Enterprise Data Warehouse every 15 minutes via a REST API, ensuring real-time accuracy for timely decision-making. This system played a pivotal role in combining traffic data with sales and budget information to enhance strategic planning and operational insights.

Inventory: The Inventory system utilizes Yantriks technology with AI and machine learning to optimize stock management and supply chain operations. It predicts demand and automates stock distribution recommendations across a network of stores and warehouses. Data on stock levels is transmitted in JSON format via Kafka, processed, and integrated into the EDW for reporting and analytics.

Staff Metrics: Staff Metrics leverages Dayforce HCM (Human Capital Management) to optimize workforce management and employee engagement. This cloud-based tool captures real-time data on staff activities, including time punches that detail arrivals, departures, and movements within store zones. These metrics are crucial for linking employee attendance with sales and traffic data, enabling analysis of staff efficiency and sales impact. Updated hourly in the EDW, this data supports dynamic performance monitoring and strategic workforce adjustments.

Budget files: Budget files are crucial for financial planning, containing various forecasts and projections related to sales, staffing, and transactions. The Finance Business Planning & Analysis department regularly updates and stores these files on SharePoint, archiving older versions on ShareDrive. The system retrieves these files as required.

ETL (Extract, Transform, Load): The ETL component, implemented using IBM DataStage, facilitates the extraction, transformation, and loading of data from various source systems into the centralized data warehouse. This process integrates disparate data into a unified format, enabling analytics and reporting.

Enterprise Data Warehouse (EDW): The EDW operates on a Microsoft SQL Server 2014 configured for OLAP (Online Analytical Processing) tasks such as complex queries and data aggregation. Established over 15 years ago, it spans a hybrid environment with on-premises servers housed in a rented data center and portions hosted in the cloud.

BI reports: Tableau Bridge and IBM Cognos/Smart are configured to deliver reports tailored to the specific needs of various management levels. Store managers receive detailed hourly sales statistics and daily summaries to oversee operations effectively. Top executives receive high-level, real-time analytics that cover broader operational metrics, facilitating strategic decision-making across the company.

OLAP cubes: OLAP cubes, built on Microsoft SQL Server Analysis Services (SSAS), efficiently process large datasets for multidimensional analysis across variables like time, product categories, and regional markets. These cubes enable customized report generation tailored to specific departmental needs.

The enhanced architecture on Azure Databricks

Following the migration of the Enterprise Data Warehouse and critical infrastructures to Azure Databricks, our client enjoys an advanced data ecosystem optimized for scalability and efficiency. This new system retains the core data sources while dramatically improving data processing capabilities.

We structured the new data platform on Azure Databricks using a classic three-tiered approach: the Bronze, Silver, and Gold layers. Each layer plays a critical role in managing the flow and refinement of data from raw inputs to actionable insights.

Click or tap on the picture to open it in full size

Bronze layer: This layer acts as the initial repository for raw data coming from various sources like databases, files, REST APIs, Kafka topics, and SharePoint. It serves as the staging area where data remains in its original form, suitable for historical data processing (HDP). This method ensures all historical data is readily accessible for additional analysis without needing to re-fetch from original sources.

Silver layer: In the silver layer, we transform and normalize data to fit a unified schema necessary for analytical processing. This layer refines the raw data into a structured format, closely mirroring the original EDW’s setup wherever possible. We only introduce changes to the data model in this layer if replicating original functionalities in the new environment is unfeasible.

Gold layer: The gold layer is where data gets aggregated and summarized, making it ready for business intelligence and decision-making. This layer generally stores data categorized by dimensions like store locations, product types, and time periods, emphasizing summary data over transactional details.

Data Workflow Orchestration with Azure Data Factory: We utilize Azure Data Factory to orchestrate ETL processes, ensuring that data from various sources is consistently extracted, transformed, and loaded into Azure Databricks, facilitating structured data flow through the Bronze, Silver, and Gold layers for effective management and analysis.

Business Intelligence and reporting: We migrated the original IBM Cognos virtual machines to the cloud, maintaining their existing functionalities. Tableau Bridge remains unchanged. Upon request from several departments, we have also added Google’s Looker.

Data consumption and export The system also supports data exports and consumption through secure protocols like SFTP, efficiently catering to external stakeholders and systems.

Spark in Azure Databricks: We leverage Spark on Azure Databricks to rapidly process large datasets, enabling real-time analytics and complex data transformations that are essential for providing immediate insights from the client’s sales data.

Data validation and quality checks: We developed a custom framework that enabled us to integrate data checkers within the pipeline. These checkers validate data at every stage of its transformation, ensuring accuracy and consistency throughout the data lifecycle.

Azure DevOps and CI/CD: We use Azure DevOps to automate our CI/CD pipeline, streamlining builds, tests, and deployments.

What major challenges did Azure Databricks resolve?

The migration to the new architecture in Azure Databricks solved two critical business problems that were limiting the client’s ability to meet the business analytics needs of both internal and external customers.

Fast and reliable implementation of new tools

Problem: Previously, incorporating modern and advanced tools into the client’s old data system was labor-intensive and required custom development solutions. This process was not only time-consuming but also often resulted in unstable implementations. The organizational structure, including multiple internal stakeholders like regional managers and store leaders, as well as external partners, frequently demanded new technological solutions, causing significant delays in deployment and adoption.

Solution: The transition to Azure Databricks has enabled a more agile infrastructure, simplifying the integration of cutting-edge tools and solutions. Azure’s robust and flexible environment allows for the seamless addition of new analytics tools, including AI-driven applications, without the extensive customization previously required.

Impact: The time to integrate new analytical tools has been reduced from several months to a few weeks, with stability and performance greatly enhanced.

Securing access to historical data for future analysis

Problem: Previously, if the business required a new report dependent on raw data not previously extracted, creating a new data structure and modifying the ETL process was necessary. However, when historical data was needed for trend analysis and was no longer available in the OLTP or other source systems, retrieving such data was simply not possible.

Solution: With the implementation of the Bronze, Silver, and Gold data layers in Azure Databricks, raw data is fully extracted and stored in the Bronze layer with minimal transformations, preserving its original form. This strategy ensures data not immediately needed for current analytics but potentially useful in the future is maintained.

Impact: This architecture ensures that if in the future a business requirement emerges that necessitates analyzing raw data from previous years—data not previously included in the Silver or Gold layers—the necessary raw data will already be accessible in the Bronze layer. Integration of this data into the analytical models can be accomplished in minimal time.

Why migrate from on-prem EDW to Azure Databricks?

After discussing how we addressed two specific challenges for our client, this chapter outlines the broader benefits of migrating from an on-premises SQL Server 2014 EDW to Azure Databricks. The comparison table below highlights key improvements and advantages, enhancing our client’s data management capabilities.

Criteria	SQL Server 2014 DW	Azure Databricks
Scalability	Limited by physical server capacities; requires manual scaling.	Dynamically scales resources automatically, no physical limits.
Data Processing Speed	Slower batch processing; delays in data availability.	Real-time data processing capabilities, reducing latency.
Analytics & AI	Limited built-in support for advanced analytics and machine learning.	Extensive integration with AI frameworks, enhancing analytics capabilities.
Cost Efficiency	Higher costs due to physical infrastructure and maintenance.	Lower total cost of ownership, pay for what you use.
Collaboration	Separate tools needed for collaboration; less integrated environment.	Unified workspace that supports multi-language and collaborative projects.
Data Security	Basic compliance and security features.	Advanced security features, compliance with global regulations.
Management	Requires more hands-on management and maintenance.	Simplified management via Azure Data Factory integration and automated workflows.
Resource Utilization	Often under or over-utilized resources, leading to inefficiency.	Efficient resource management tailored to real-time needs.

Partner with ABCloudz for your migration needs

The complexity and scale of data migration require deep expertise and a robust technological toolkit. At ABCloudz, we bring over 15 years of experience to modernize traditional data systems into modern, scalable, cloud-based solutions. We are the only company to have sold our migration products to Microsoft and Amazon.

If your organization faces similar data infrastructure challenges, contact us to see how our expertise can meet your specific needs.

Trust the experts who have successfully completed hundreds of migrations.