Microtek Learning Logo

HDP Analyst Data Science Training


With the help of machine learning and natural language processing, this course will train students on the principles and methods of data science.

  • Category : Hortonworks

Course Price : $1999 Per Participant

Course Description

With the help of machine learning and natural language processing, this course will train students on the principles and methods of data science.

The Natural Language Toolkit (NLTK) and Spark MLlib are also included, along with many tools and programming languages (Python, Mahout, IPython, SciPy, Pig, pandas, NumPy, and Scikitlearn).

Microsoft Course Microsoft Course


experience experience

Years of Experience

learners learners

Global Learners

What you will learn

  • green-tick Describe the Hadoop and YARN architecture
  • green-tick Describe supervised and unsupervised learning differences
  • green-tick Use Mahout to run a machine learning algorithm on Hadoop
  • green-tick Describe the data science life cycle
  • green-tick Use Pig to transform and prepare data on Hadoop
  • green-tick Write a Python script
  • green-tick Describe options for running Python code on a Hadoop cluster
  • green-tick Write a Pig User-Defined Function in Python
  • green-tick Use Pig streaming on Hadoop with a Python script
  • green-tick Use machine learning algorithms
  • green-tick Describe use cases for Natural Language Processing (NLP)
  • green-tick Use the Natural Language Toolkit (NLTK)
  • green-tick Describe the components of a Spark application
  • green-tick Write a Spark application in Python
  • green-tick Run machine learning algorithms using Spark MLlib
  • green-tick Take data science into production.


  • Students must be familiar with at least one programming or scripting language, statistics, mathematics, and the fundamentals of Hadoop. Attending the HDP Overview.

Who should attend this course?

  • Data scientists who need to use machine learning and data science on Hadoop, including architects, analysts, software developers, and data scientists.


Oops! For this course, there are currently no public schedules available. Clicking on "Notify Me" will allow you to express your interest.

For dates, times, and location customization of this course, get in touch with us.

You can also speak with a learning consultant by calling 800-961-0337.


a. Setting Up a Development Environment

  • Demo: Block Storage
  • b. Using HDFS Commands

  • Demo: MapReduce
  • c. Using Apache Mahout for Machine Learning

  • Demo: Apache Pig
  • d. Getting Started with Apache Pig

    e. Exploring Data with Pig

    f. Using the IPython Notebook

  • Demo: The NumPy Package
  • Demo: The pandas Library
  • g. Data Analysis with Python

    h. Interpolating Data Points

    i. Defining a Pig UDF in Python

    j. Streaming Python with Pig

  • Demo: Classification with Scikit-Learn
  • k. Computing K-Nearest Neighbor

    l. Generating a K-Means Clustering

    m. POS Tagging Using a Decision Tree

    n. Using NLTK for Natural Language Processing

    o. Classifying Text using Naive Bayes

    p. Using Spark Transformations and Actions

    q. Using Spark MLlib

    r. Creating a Spam Classifier with MLlib

    Course Details

    • enroll enroll-green
      Enrolled: 1246
    • duration duration green
      Duration: 3 Days

    Talk to Learning Advisor