EXPLORING DATA LAKES: A CORNERSTONE OF BIG DATA ENGINEERING

Authors

  • Vishnu Vardhan Amdiyala Binghamton University, USA. Author

Keywords:

Data Lakes, Big Data Engineering, Machine Learning, Advanced Analytics

Abstract

Massive amounts of data have made it necessary to create strong tools for managing and analyzing it. As a result of the difficulties of dealing with very large datasets, big data engineering has become an important field, and data lakes are one of its main products. There are many benefits to using data lakes instead of traditional data warehouses, and this piece talks about them and how they help with advanced analytics and machine learning applications. As a central location for different kinds of data, data lakes help businesses get useful information from their data and promote innovation in many areas

References

D. Reinsel, J. Gantz, and J. Rydning, "Data Age 2025: The Digitization of the World From Edge to Core," IDC White Paper, sponsored by Seagate, Nov. 2018.

J. Wiener and N. Bronson, "Facebook's Top Open Data Problems," Facebook Research, Oct. 2014.

Statista, "Internet of Things (IoT) connected devices installed base worldwide from 2015 to 2025," 2021.

A. Oussous, F. Z. Benjelloun, A. A. Lahcen, and S. Belfkih, "Big Data technologies: A survey," Journal of King Saud University-Computer and Information Sciences, vol. 30, no. 4, pp. 431-448, 2018.

M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano, "Analytics: The real-world use of big data," IBM Global Business Services, vol. 12, pp. 1-20, 2012.

R. Kimball and M. Ross, "The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling," John Wiley & Sons, 2013.

A. Gorelik, "The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science," O'Reilly Media, Inc., 2019.

N. Miloslavskaya and A. Tolstoy, "Big Data, Fast Data and Data Lake Concepts," Procedia Computer Science, vol. 88, pp. 300-305, 2016.

Gartner, "Gartner Predicts 2021: Data and Analytics Strategies to Govern, Scale and Transform Digital Business," Dec. 2020.

I. A. T. Hashem et al., "The rise of "big data" on cloud computing: Review and open research issues," Information Systems, vol. 47, pp. 98-115, 2015.

R. Kimball and M. Ross, "The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling," John Wiley & Sons, 2013.

A. Oussous, F. Z. Benjelloun, A. A. Lahcen, and S. Belfkih, "Big Data technologies: A survey," Journal of King Saud University-Computer and Information Sciences, vol. 30, no. 4, pp. 431-448, 2018.

Forrester Research, "Insights-Driven Businesses Set The Pace For Global Growth," Oct. 2018.

Gartner, "Gartner Predicts 2021: Data and Analytics Strategies to Govern, Scale and Transform Digital Business," Dec. 2020.

A. Gorelik, "The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science," O'Reilly Media, Inc., 2019.

N. Miloslavskaya and A. Tolstoy, "Big Data, Fast Data and Data Lake Concepts," Procedia Computer Science, vol. 88, pp. 300-305, 2016.

IDC, "Data Lakes: Purposes, Practices, Patterns, and Platforms," Mar. 2017.

I. A. T. Hashem et al., "The rise of "big data" on cloud computing: Review and open research issues," Information Systems, vol. 47, pp. 98-115, 2015.

S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, "A survey of open source tools for machine learning with big data in the Hadoop ecosystem," Journal of Big Data, vol. 2, no. 1, pp. 1-36, 2015.

Amazon Web Services, "Data Lake on AWS: Build a Secure Data Lake in Healthcare," AWS Whitepaper, 2021.

Gartner, "Gartner Predicts 2021: Data and Analytics Strategies to Govern, Scale and Transform Digital Business," Dec. 2020.

Harvard Business Review, "Data Lakes: The Definitive Guide," Nov. 2019.

N. Miloslavskaya and A. Tolstoy, "Big Data, Fast Data and Data Lake Concepts," Procedia Computer Science, vol. 88, pp. 300-305, 2016.

Cloudera, "Cloudera Data Science Workbench: Self-Service Data Science for the Enterprise," Cloudera Datasheet, 2019.

J. Kreps, N. Narkhede, and J. Rao, "Kafka: A Distributed Messaging System for Log Processing," in Proceedings of the 6th International Workshop on Networking Meets Databases (NetDB), 2011.

Apache Flume, "Flume User Guide," Apache Software Foundation, 2021.

M. Zaharia et al., "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016.

Databricks, "Apache Spark Benchmark," Databricks Whitepaper, 2016.

I. A. T. Hashem et al., "The rise of "big data" on cloud computing: Review and open research issues," Information Systems, vol. 47, pp. 98-115, 2015.

Gartner, "Gartner Predicts 2021: Data and Analytics Strategies to Govern, Scale and Transform Digital Business," Dec. 2020.

Netflix Technology Blog, "Evolution of the Netflix Data Pipeline," Feb. 2016.

X. Amatriain and J. Basilico, "Recommender Systems in Industry: A Netflix Case Study," in Recommender Systems Handbook, Springer, 2015, pp. 385-419.

S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, "A survey of open source tools for machine learning with big data in the Hadoop ecosystem," Journal of Big Data, vol. 2, no. 1, pp. 1-36, 2015.

McKinsey & Company, "How companies are using big data and analytics," McKinsey Global Institute, Apr. 2016.

I. A. T. Hashem et al., "The rise of "big data" on cloud computing: Review and open research issues," Information Systems, vol. 47, pp. 98-115, 2015.

S. Landset, T. M. Khoshgoftaar, A. N. Richter, and T. Hasanin, "A survey of open source tools for machine learning with big data in the Hadoop ecosystem," Journal of Big Data, vol. 2, no. 1, pp. 1-36, 2015.

Gartner, "Gartner Predicts 2021: Data and Analytics Strategies to Govern, Scale and Transform Digital Business," Dec. 2020.

T. M. Mitchell, "Machine Learning," McGraw-Hill, 1997.

McKinsey & Company, "The State of AI in 2020," McKinsey Global Survey, Dec. 2020.

A. G. Shoro and T. R. Soomro, "Big data analysis: Apache spark perspective," Global Journal of Computer Science and Technology, vol. 15, no. 1, pp. 7-14, 2015.

K. Borne, "Top 10 Big Data Challenges - A Serious Look at 10 Big Data V's," MAPR, 2014.

Uber Engineering Blog, "Scaling Machine Learning at Uber with Michelangelo," Sept. 2017.

B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, "Item-based collaborative filtering recommendation algorithms," in Proceedings of the 10th International Conference on World Wide Web (WWW), 2001, pp. 285-295.

Accenture, "Personalization Pulse Check," Accenture Interactive, 2018.

Y. Wang, L. A. Kung, and T. A. Byrd, "Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations," Technological Forecasting and Social Change, vol. 126, pp. 3-13, 2018.

Cleveland Clinic, "Cleveland Clinic's Data-Driven Approach to Improving Patient Care," Intel Case Study, 2018.

D. Byers, "The Role of Big Data and AI in the Future of Fraud Detection," Forbes, Jan. 2021.

Capgemini, "Artificial Intelligence in Banking," Capgemini Whitepaper, 2019.

T. H. Davenport and J. G. Harris, "Competing on Analytics: The New Science of Winning," Harvard Business Press, 2007.

R. Bean, "Why Is It Important For Organizations To Use Data To Gain A Competitive Edge?" Forbes, Jan. 2021.

Downloads

Published

2024-05-29

How to Cite

Vishnu Vardhan Amdiyala. (2024). EXPLORING DATA LAKES: A CORNERSTONE OF BIG DATA ENGINEERING. INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND TECHNOLOGY (IJARET), 15(3), 211-220. https://lib-index.com/index.php/IJARET/article/view/IJARET_15_03_018