ANALYTICS PATH.

Module: Introduction to Python

In this section, you’ll learn all about the basics of the Python programming language: defining variables; working with lists, tuples, sets and dictionaries and more… Click here to read more.

In this section, you’ll learn all about the basics of file handling in Python and conduct some analysis on text and CSV files. Click here to read more.

In this section, you’ll learn about NumPy (Numerical Python) – including working with arrays, matrix, vectors and calculating the dot product. Click here to read more.

In this section, you’ll learn about Pandas – a super flexible tool we can use for data analysis in Python. We’ll cover many dataframe and series functions. Click here to read more.

Module: Some Sample Python Use Cases

This script aims to normalise domain names. That is, drop the subdomain from a list of domains, to ‘clean’ them and be able to analyse the usage across a whole domain much more easily. This dataset is from https://www.domcop.com/top-10-million-domains  Click here to read more.

Sentiment analysis can provide key insight into the feelings of your customers towards your company & hence is becoming an increasingly important part of data analysis.Building a machine learning model to identify positive and negative sentiments is pretty complex, but luckily for us, there is a Python library that can help us out. It’s called TextBlob.Through this post, we’ll look at how we use TextBlob with Python & the CSV functionality & also with Pandas, using dataframes.  Click here to read more.

Using customer usage logs, I need to identify the customers that travel most each day and understand the most popular routes across the world.  Click here to read more.

Machine Learning is a growing field & it seems that almost every company are deploying machine learning algorithms to solve problems. The question is, is machine learning actually necessary, or is it just so companies can say ‘I’m doing that too’ ?. Click here to read more.

The below script shows how we may handle RFM segmentation with Python. RFM stands for Recency, Frequency and Monetary:Recency: How many days since they last purchased Frequency: Total purchase count by customer Monetary: Total spend by customer Click here to read more.

Module: Machine Learning Overview

Machine learning uses statistical techniques to give computer systems the ability to ‘learn’ rather than being explicitly programmed. By learning from historical inputs. we’re able to achieve far greater accuracy in our predictions & constantly refine the model with new data. Click here to read more.

Supervised learning is where we provide the model with the actual outputs from the data. This let’s it build a picture of the data and form links between the historic parameters (or features) that have influenced the output. To put a formula onto supervised learning, it would be as below, where, Y is the predicted output, produced by the model and X is the input data. So, by executing a function against X, we can predict Y. Click here to read more.

A decision tree builds a model in the form of a tree structure – almost like a flow chart. In order to calculate the expected outcome, it uses decision points and based on the results of those decisions, it’ll bucket each input. In this article, we’ll talk about classification and regression decision trees, along with random forests.Click here to read more.

In the table above, A is the constant (the Y intercept), also known as B0. X is the X multiplier (also known as B1). So in our equation Y = B0 + B1 (X); we can substitute B0 and B1 for the values in the coefficient column of the table.Click here to read more.

Module: Machine Learning Practicals

In this section, we cover data standardisation and encoding, designed to get data ready for machine learning models and eliminate as much bias as possible. Click here to read more.

In this section, we will implement one of the most common machine learning models – linear regression.  Click here to read more.

In this section, we will implement one of the unsupervised learning models, K-Means clustering. This is a really useful model & the code sample will give you a good starting point.  Click here to read more.

In this section, we will implement the K-Nearest Neighbors algorithm.  Click here to read more.

This article will take you through a practical implementation, where based on historic data, we aim to predict future weather. The data for this model is continuous & hence requires a regression model, rather than a discrete classification model. Remember, random forests are supervised machine learning models. They’re supervised, because we give them the ‘truth’ in order to help them learn. Click here to read more.

When completing my domain normalisation project, I used Spark to do the heavy lifting – getting data in to a dataframe & aggregating (group by and sum) and then used Pandas for the domain manipulation. Finally, I converted my Pandas dataframe back to Spark, to write it to HDFS.Click here to read more.

In this section, we will find out how to validate the accuracy of our models. Additionally, we will talk about how we can tune our model to make it more accurate.  Click here to read more.

Module: End to End Machine Learning Projects

This data set consists of 100 variables and approx 100 thousand records. This data set contains different variables explaining the attributes of telecom industry and various factors considered important while dealing with customers of telecom industry. The target variable here is churn which explains whether the customer will churn or not. We can use this data set to predict the customers who would churn or who wouldn’t churn depending on various variables available. Click here to read more.