Course Overview

Course Coverage

This course covers techniques needed to collect, store, analyze, and visualize big data, particularly for applications in machine learning at scale. The MapReduce paradigm will be taught using the popular Hadoop framework. Both batch and real-time analysis of massive quantities of data will be applied to machine learning problems such as clustering, regression, and classification. Relational SQL and NoSQL database models will be discussed and compared with use case analysis. Natural language processing will be studied as a comprehensive big data application.

Student Learning Outcomes

By the end of the course, the successful students will be able to:

  • (Concepts) Design distributed systems that manage “big data” using Hadoop and related technologies in the Hadoop Ecosystem,
  • (Scaling) Use HDFS and MapReduce for storing and analyzing data at scale,
  • (Programming) Write Pig and Spark scripts to process data on a distributed cluster in a more complex and programmable way,
  • (Storing) Analyze relational data using Hive and MySQL, and analyze non-relational data using HBase and MongoDB. Learn to choose an appropriate data storage technology for your applications,
  • (Modeling) Deploy machine learning models at scale to process distributed big data using Spark MLlib,
  • (Managing) Understand how Hadoop clusters are managed by Big Data Framework Managers like YARN, Mesos, Zookeeper and Oozie, etc.,
  • (Streaming) Publish data to computing clusters using Kafka and Flume, and analyze streaming data using Spark Streaming.
Next