Agile Data Science – SparkML

Agile Data Science – SparkML ”; Previous Next Machine learning library also called the “SparkML” or “MLLib” consists of common learning algorithms, including classification, regression, clustering and collaborative filtering. Why learn SparkML for Agile? Spark is becoming the de-facto platform for building machine learning algorithms and applications. The developers work on Spark for implementing machine algorithms in a scalable and concise manner in the Spark framework. We will learn the concepts of Machine learning, its utilities and algorithms with this framework. Agile always opts for a framework, which delivers short and quick results. ML Algorithms ML Algorithms include common learning algorithms such as classification, regression, clustering and collaborative filtering. Features It includes feature extraction, transformation, dimension reduction and selection. Pipelines Pipelines provide tools for constructing, evaluating and tuning machine-learning pipelines. Popular Algorithms Following are a few popular algorithms − Basic Statistics Regression Classification Recommendation System Clustering Dimensionality Reduction Feature Extraction Optimization Recommendation System A recommendation system is a subclass of information filtering system that seeks prediction of “rating” and “preference” that a user suggests to a given item. Recommendation system includes various filtering systems, which are used as follows − Collaborative Filtering It includes building a model based on the past behavior as well as similar decisions made by other users. This specific filtering model is used to predict items that a user is interested to take in. Content based Filtering It includes the filtering of discrete characteristics of an item in order to recommend and add new items with similar properties. In our subsequent chapters, we will focus on the use of recommendation system for solving a specific problem and improving the prediction performance from the agile methodology point of view. Print Page Previous Next Advertisements ”;

Building a Regression Model

Building a Regression Model ”; Previous Next Logistic Regression refers to the machine learning algorithm that is used to predict the probability of categorical dependent variable. In logistic regression, the dependent variable is binary variable, which consists of data coded as 1 (Boolean values of true and false). In this chapter, we will focus on developing a regression model in Python using continuous variable. The example for linear regression model will focus on data exploration from CSV file. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit. import pandas as pd import numpy as np from sklearn import preprocessing import matplotlib.pyplot as plt plt.rc(“font”, size=14) from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split import seaborn as sns sns.set(style=”white”) sns.set(style=”whitegrid”, color_codes=True) data = pd.read_csv(”bank.csv”, header=0) data = data.dropna() print(data.shape) print(list(data.columns)) Follow these steps to implement the above code in Anaconda Navigator with “Jupyter Notebook” − Step 1 − Launch the Jupyter Notebook with Anaconda Navigator. Step 2 − Upload the csv file to get the output of regression model in systematic manner. Step 3 − Create a new file and execute the above-mentioned code line to get the desired output. Print Page Previous Next Advertisements ”;

Extracting features with PySpark

Extracting features with PySpark ”; Previous Next In this chapter, we will learn about the application of the extracting features with PySpark in Agile Data Science. Overview of Spark Apache Spark can be defined as a fast real-time processing framework. It does computations to analyze data in real time. Apache Spark is introduced as stream processing system in real-time and can also take care of batch processing. Apache Spark supports interactive queries and iterative algorithms. Spark is written in “Scala programming language”. PySpark can be considered as a combination of Python with Spark. PySpark offers PySpark shell, which links Python API to the Spark core and initializes the Spark context. Most of the data scientists use PySpark for tracking features as discussed in the previous chapter. In this example, we will focus on the transformations to build a dataset called counts and save it to a particular file. text_file = sc.textFile(“hdfs://…”) counts = text_file.flatMap(lambda line: line.split(” “)) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile(“hdfs://…”) Using PySpark, a user can work with RDDs in python programming language. The inbuilt library, which covers the basics of Data Driven documents and components, helps in this. Print Page Previous Next Advertisements ”;

Agile Tools & Installation

Agile Tools and Installation ”; Previous Next In this chapter, we will learn about the different Agile tools and their installation. The development stack of agile methodology includes the following set of components − Events An event is an occurrence that happens or is logged along with its features and timestamps. An event can come in many forms like servers, sensors, financial transactions or actions, which our users take in our application. In this complete tutorial, we will use JSON files that will facilitate data exchange among different tools and languages. Collectors Collectors are event aggregators. They collect events in a systematic manner to store and aggregate bulky data queuing them for action by real time workers. Distributed document These documents include multinode (multiple nodes) which stores document in a specific format. We will focus on MongoDB in this tutorial. Web application server Web application server enables data as JSON through client through visualization, with minimal overhead. It means web application server helps to test and deploy the projects created with agile methodology. Modern Browser It enables modern browser or application to present data as an interactive tool for our users. Local Environmental Setup For managing data sets, we will focus on the Anaconda framework of python that includes tools for managing excel, csv and many more files. The dashboard of Anaconda framework once installed is as shown below. It is also called the “Anaconda Navigator” − The navigator includes the “Jupyter framework” which is a notebook system that helps to manage datasets. Once you launch the framework, it will be hosted in the browser as mentioned below − Print Page Previous Next Advertisements ”;

Fixing Prediction Problem

Fixing Prediction Problem ”; Previous Next In this chapter, we will focus on fixing a prediction problem with the help of a specific scenario. Consider that a company wants to automate the loan eligibility details as per the customer details provided through online application form. The details include name of customer, gender, marital status, loan amount and other mandatory details. The details are recorded in the CSV file as shown below − Execute the following code to evaluate the prediction problem − import pandas as pd from sklearn import ensemble import numpy as np from scipy.stats import mode from sklearn import preprocessing,model_selection from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder #loading the dataset data=pd.read_csv(”train.csv”,index_col=”Loan_ID”) def num_missing(x): return sum(x.isnull()) #imputing the the missing values from the data data[”Gender”].fillna(mode(list(data[”Gender”])).mode[0], inplace=True) data[”Married”].fillna(mode(list(data[”Married”])).mode[0], inplace=True) data[”Self_Employed”].fillna(mode(list(data[”Self_Employed”])).mode[0], inplace=True) # print (data.apply(num_missing, axis=0)) # #imputing mean for the missing value data[”LoanAmount”].fillna(data[”LoanAmount”].mean(), inplace=True) mapping={”0”:0,”1”:1,”2”:2,”3+”:3} data = data.replace({”Dependents”:mapping}) data[”Dependents”].fillna(data[”Dependents”].mean(), inplace=True) data[”Loan_Amount_Term”].fillna(method=”ffill”,inplace=True) data[”Credit_History”].fillna(method=”ffill”,inplace=True) print (data.apply(num_missing,axis=0)) #converting the cateogorical data to numbers using the label encoder var_mod = [”Gender”,”Married”,”Education”,”Self_Employed”,”Property_Area”,”Loan_Status”] le = LabelEncoder() for i in var_mod: le.fit(list(data[i].values)) data[i] = le.transform(list(data[i])) #Train test split x=[”Gender”,”Married”,”Education”,”Self_Employed”,”Property_Area”,”LoanAmount”, ”Loan_Amount_Term”,”Credit_History”,”Dependents”] y=[”Loan_Status”] print(data[x]) X_train,X_test,y_train,y_test=model_selection.train_test_split(data[x],data[y], test_size=0.2) # # #Random forest classifier # clf=ensemble.RandomForestClassifier(n_estimators=100, criterion=”gini”,max_depth=3,max_features=”auto”,n_jobs=-1) clf=ensemble.RandomForestClassifier(n_estimators=200,max_features=3,min_samples _split=5,oob_score=True,n_jobs=-1,criterion=”entropy”) clf.fit(X_train,y_train) accuracy=clf.score(X_test,y_test) print(accuracy) Output The above code generates the following output. Print Page Previous Next Advertisements ”;

Data Enrichment

Agile Data Science – Data Enrichment ”; Previous Next Data enrichment refers to a range of processes used to enhance, refine and improve raw data. It refers to useful data transformation (raw data to useful information). The process of data enrichment focusses on making data a valuable data asset for modern business or enterprise. The most common data enrichment process includes correction of spelling mistakes or typographical errors in database through use of specific decision algorithms. Data enrichment tools add useful information to simple data tables. Consider the following code for spell correction of words − import re from collections import Counter def words(text): return re.findall(r”w+”, text.lower()) WORDS = Counter(words(open(”big.txt”).read())) def P(word, N=sum(WORDS.values())): “Probabilities of words” return WORDS[word] / N def correction(word): “Spelling correction of word” return max(candidates(word), key=P) def candidates(word): “Generate possible spelling corrections for word.” return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]) def known(words): “The subset of `words` that appear in the dictionary of WORDS.” return set(w for w in words if w in WORDS) def edits1(word): “All edits that are one edit away from `word`.” letters = ”abcdefghijklmnopqrstuvwxyz” splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts) def edits2(word): “All edits that are two edits away from `word`.” return (e2 for e1 in edits1(word) for e2 in edits1(e1)) print(correction(”speling”)) print(correction(”korrectud”)) In this program, we will match with “big.txt” which includes corrected words. Words match with words included in text file and print the appropriate results accordingly. Output The above code will generate the following output − Print Page Previous Next Advertisements ”;

Agile Data Science – Home

Agile Data Science Tutorial PDF Version Quick Guide Resources Job Search Discussion Agile is a software development methodology that helps in building software through incremental sessions using short iterations of 1 to 4 weeks so that the development is aligned with the changing business needs. Agile Data science comprises of a combination of agile methodology and data science. In this tutorial, we have used appropriate examples to help you understand agile development and data science in a general and quick way. Audience This tutorial has been prepared for developers and project managers to help them understand the basics of agile principles and its implementation. After completing this tutorial, you will find yourself at a moderate level of expertise, from where you can advance further with implementation of data science and agile methodology. Prerequisites It is important to have basic knowledge of data science modules and software development concepts such as software requirements, coding along with testing. Print Page Previous Next Advertisements ”;

Improving Prediction Performance

Improving Prediction Performance ”; Previous Next In this chapter, we will focus on building a model that helps in the prediction of student’s performance with a number of attributes included in it. The focus is to display the failure result of students in an examination. Process The target value of assessment is G3. This values can be binned and further classified as failure and success. If G3 value is greater than or equal to 10, then the student passes the examination. Example Consider the following example wherein a code is executed to predict the performance if students − import pandas as pd “”” Read data file as DataFrame “”” df = pd.read_csv(“student-mat.csv”, sep=”;”) “”” Import ML helpers “”” from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest, chi2 from sklearn.svm import LinearSVC # Support Vector Machine Classifier model “”” Split Data into Training and Testing Sets “”” def split_data(X, Y): return train_test_split(X, Y, test_size=0.2, random_state=17) “”” Confusion Matrix “”” def confuse(y_true, y_pred): cm = confusion_matrix(y_true=y_true, y_pred=y_pred) # print(“nConfusion Matrix: n”, cm) fpr(cm) ffr(cm) “”” False Pass Rate “”” def fpr(confusion_matrix): fp = confusion_matrix[0][1] tf = confusion_matrix[0][0] rate = float(fp) / (fp + tf) print(“False Pass Rate: “, rate) “”” False Fail Rate “”” def ffr(confusion_matrix): ff = confusion_matrix[1][0] tp = confusion_matrix[1][1] rate = float(ff) / (ff + tp) print(“False Fail Rate: “, rate) return rate “”” Train Model and Print Score “”” def train_and_score(X, y): X_train, X_test, y_train, y_test = split_data(X, y) clf = Pipeline([ (”reduce_dim”, SelectKBest(chi2, k=2)), (”train”, LinearSVC(C=100)) ]) scores = cross_val_score(clf, X_train, y_train, cv=5, n_jobs=2) print(“Mean Model Accuracy:”, np.array(scores).mean()) clf.fit(X_train, y_train) confuse(y_test, clf.predict(X_test)) print() “”” Main Program “”” def main(): print(“nStudent Performance Prediction”) # For each feature, encode to categorical values class_le = LabelEncoder() for column in df[[“school”, “sex”, “address”, “famsize”, “Pstatus”, “Mjob”, “Fjob”, “reason”, “guardian”, “schoolsup”, “famsup”, “paid”, “activities”, “nursery”, “higher”, “internet”, “romantic”]].columns: df[column] = class_le.fit_transform(df[column].values) # Encode G1, G2, G3 as pass or fail binary values for i, row in df.iterrows(): if row[“G1”] >= 10: df[“G1”][i] = 1 else: df[“G1”][i] = 0 if row[“G2”] >= 10: df[“G2”][i] = 1 else: df[“G2”][i] = 0 if row[“G3”] >= 10: df[“G3”][i] = 1 else: df[“G3”][i] = 0 # Target values are G3 y = df.pop(“G3”) # Feature set is remaining features X = df print(“nnModel Accuracy Knowing G1 & G2 Scores”) print(“=====================================”) train_and_score(X, y) # Remove grade report 2 X.drop([“G2”], axis = 1, inplace=True) print(“nnModel Accuracy Knowing Only G1 Score”) print(“=====================================”) train_and_score(X, y) # Remove grade report 1 X.drop([“G1”], axis=1, inplace=True) print(“nnModel Accuracy Without Knowing Scores”) print(“=====================================”) train_and_score(X, y) main() Output The above code generates the output as shown below The prediction is treated with reference to only one variable. With reference to one variable, the student performance prediction is as shown below − Print Page Previous Next Advertisements ”;

Implementation of Agile

Agile Data Science – Implementation of Agile ”; Previous Next There are various methodologies used in the agile development process. These methodologies can be used for data science research process as well. The flowchart given below shows the different methodologies − Scrum In software development terms, scrum means managing work with a small team and management of a specific project to reveal the strength and weaknesses of the project. Crystal methodologies Crystal methodologies include innovative techniques for product management and execution. With this method, teams can go about similar tasks in different ways. Crystal family is one of the easiest methodology to apply. Dynamic Software Development Method This delivery framework is primarily used to implement the current knowledge system in software methodology. Future driven development The focus of this development life cycle is features involved in project. It works best for domain object modeling, code and feature development for ownership. Lean Software development This method aims at increasing the speed of software development at low cost and focusses the team on delivering specific value to customer. Extreme Programming Extreme programming is a unique software development methodology, which focusses on improving the software quality. This comes effective when the customer is not sure about the functionality of any project. Agile methodologies are taking root in data science stream and it is considered as the important software methodology. With agile self-organizing, cross-functional teams can work together in effective manner. As mentioned there are six main categories of agile development and each one of them can be streamed with data science as per the requirements. Data science involves an iterative process for statistical insights. Agile helps in breaking down the data science modules and helps in processing iterations and sprints in effective manner. The process of Agile Data Science is an amazing way of understanding how and why data science module is implemented. It solves problems in creative manner. Print Page Previous Next Advertisements ”;

NoSQL & Dataflow programming

NoSQL and Dataflow programming ”; Previous Next There are times when the data is unavailable in relational format and we need to keep it transactional with the help of NoSQL databases. In this chapter, we will focus on the dataflow of NoSQL. We will also learn how it is operational with a combination of agile and data science. One of the major reasons to use NoSQL with agile is to increase the speed with market competition. The following reasons show how NoSQL is a best fit to agile software methodology − Fewer Barriers Changing the model, which at present is going through mid-stream has some real costs even in case of agile development. With NoSQL, the users work with aggregate data instead of wasting time in normalizing data. The main point is to get something done and working with the goal of making model perfect data. Increased Scalability Whenever an organization is creating product, it lays more focus on its scalability. NoSQL is always known for its scalability but it works better when it is designed with horizontal scalability. Ability to leverage data NoSQL is a schema-less data model that allows the user to readily use volumes of data, which includes several parameters of variability and velocity. When considering a choice of technology, you should always consider the one, which leverages the data to a greater scale. Dataflow of NoSQL Let us consider the following example wherein, we have shown how a data model is focused on creating the RDBMS schema. Following are the different requirements of schema − User Identification should be listed. Every user should have mandatory at least one skill. The details of every user’s experience should be maintained properly. The user table is normalized with 3 separate tables − Users User skills User experience The complexity increases while querying the database and time consumption is noted with increased normalization which is not good for Agile methodology. The same schema can be designed with the NoSQL database as mentioned below − NoSQL maintains the structure in JSON format, which is light- weight in structure. With JSON, applications can store objects with nested data as single documents. Print Page Previous Next Advertisements ”;