MIGRATING LEGACY DATA WAREHOUSES TO CLOUD-BASED DATA LAKES: A PRACTICAL APPROACH WITH REAL-WORLD LESSONS

Authors

  • Vishnu Vardhan Reddy Chilukoori Amazon.com Services LLC, USA Author
  • Srikanth Gangarapu AT&T Services Inc, USA. Author
  • Abhishek Vajpayee Metropolis Technologies, USA. Author

Keywords:

Data Warehouse Migration, Cloud-Based Data Lakes, ETL Process Adaptation, Performance Optimization, Big Data Technologies

Abstract

This comprehensive article explores the migration process from legacy data warehouses to cloud-based data lakes, addressing the growing need for scalable and flexible data management solutions in the face of exponential data growth. It covers key aspects of the migration journey, including assessment and planning, data modeling considerations, ETL process adaptation, technical implementation, performance optimization, and common challenges with their solutions. The article provides practical guidance, best practices, and real-world insights to help organizations successfully navigate this complex transition. A case study of a financial services company migrating from a SAS environment to a Hadoop/Spark ecosystem illustrates the challenges and benefits of such a migration, offering valuable lessons for other organizations undertaking similar projects.

References

D. Reinsel, J. Gantz, and J. Rydning, "The Digitization of the World: From Edge to Core," IDC White Paper, Nov. 2018. [Online]. Available: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

Gartner, "Gartner Says the Future of the Database Market Is the Cloud," Jul. 2019. [Online]. Available: https://www.gartner.com/en/newsroom/press-releases/2019-07-01-gartner-says-the-future-of-the-database-market-is-the

R. Kimball and M. Ross, "The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling," 3rd ed., Wiley, 2013. [Online]. Available: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/

J. Dixon, "Pentaho, Hadoop, and Data Lakes," James Dixon's Blog, Oct. 2010. [Online]. Available: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

A. Pal and M. Agrawal, "Using Cloud Data Lakes to Break Down Data Silos: A Solution for the Enterprise Data Warehouse," IEEE Engineering Management Review, vol. 47, no. 2, pp. 103-113, 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8731779

J. Sawyer, "The Rise of the Data Lake in Modern Analytics Architecture," Towards Data Science, Mar. 2019. [Online]. Available: https://towardsdatascience.com/the-rise-of-the-data-lake-in-modern-analytics-architecture-5d04be5cb1a9

A. Gorelik, "The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science," O'Reilly Media, Inc., 2019. [Online]. Available: https://www.oreilly.com/library/view/the-enterprise-big/9781491931547/

P. Sawadogo, T. Darmont, "On data lake architectures and metadata management," Journal of Intelligent Information Systems, vol. 56, pp. 97-120, 2021. [Online]. Available: https://link.springer.com/article/10.1007/s10844-020-00608-7

A. Floratou, "SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures," Proceedings of the VLDB Endowment, vol. 7, no. 12, pp. 1295-1306, 2014. [Online]. Available: https://dl.acm.org/doi/10.14778/2732977.2733002

Amazon Web Services, "Using EMR managed scaling in Amazon EMR," AWS Documentation, 2023. [Online]. Available: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-scaling.html

W. H. Inmon and D. Linstedt, "Data Architecture: A Primer for the Data Scientist," 2nd ed., Academic Press, 2019. [Online]. Available: https://www.sciencedirect.com/book/9780128169162/data-architecture

A. Raj, A. Boaz, and K. Narasimhan, "Accelerate the Journey to Cloud Using IBM Cloud Pak for Data," IBM Redbooks, 2020. [Online]. Available: http://www.redbooks.ibm.com/abstracts/sg248462.html

J. Kapalka, M. Chytry, and S. Fiore, "From SAS to Spark - Migration Strategy and Execution," in 2022 IEEE International Conference on Big Data (Big Data), 2022, pp. 5813-5822. [Online]. Available: https://ieeexplore.ieee.org/document/10020963

A. Spark, B. Hadoop, and C. Analytics, "Modernizing Analytics: A Case Study in Migrating SAS Workloads to Open-Source," Databricks Blog, Mar. 2021. [Online]. Available: https://databricks.com/blog/2021/03/15/modernizing-analytics-a-case-study-in-migrating-sas-workloads-to-open-source.html

Downloads

Published

2024-08-02

How to Cite

Vishnu Vardhan Reddy Chilukoori, Srikanth Gangarapu, & Abhishek Vajpayee. (2024). MIGRATING LEGACY DATA WAREHOUSES TO CLOUD-BASED DATA LAKES: A PRACTICAL APPROACH WITH REAL-WORLD LESSONS. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING AND TECHNOLOGY (IJCET), 15(4), 238-250. https://lib-index.com/index.php/IJCET/article/view/IJCET_15_04_020