agile Data Science Archives - Page 2 of 3 - Donotsad where can learn any thing work project and make money

Aug 10

Agile Data Science – SparkML

Agile Data Science – SparkML ”; Previous Next Machine learning library also called the “SparkML” or “MLLib” consists of common learning algorithms, including classification, regression, clustering and collaborative filtering. Why learn SparkML for Agile? Spark is becoming the de-facto platform for building machine learning algorithms and applications. The developers work on Spark for implementing machine algorithms in a scalable and concise manner in the Spark framework. We will learn the concepts of Machine learning, its utilities and algorithms with this framework. Agile always opts for a framework, which delivers short and quick results. ML Algorithms ML Algorithms include common learning algorithms such as classification, regression, clustering and collaborative filtering. Features It includes feature extraction, transformation, dimension reduction and selection. Pipelines Pipelines provide tools for constructing, evaluating and tuning machine-learning pipelines. Popular Algorithms Following are a few popular algorithms − Basic Statistics Regression Classification Recommendation System Clustering Dimensionality Reduction Feature Extraction Optimization Recommendation System A recommendation system is a subclass of information filtering system that seeks prediction of “rating” and “preference” that a user suggests to a given item. Recommendation system includes various filtering systems, which are used as follows − Collaborative Filtering It includes building a model based on the past behavior as well as similar decisions made by other users. This specific filtering model is used to predict items that a user is interested to take in. Content based Filtering It includes the filtering of discrete characteristics of an item in order to recommend and add new items with similar properties. In our subsequent chapters, we will focus on the use of recommendation system for solving a specific problem and improving the prediction performance from the agile methodology point of view. Print Page Previous Next Advertisements ”;

Aug 10

Building a Regression Model

Building a Regression Model ”; Previous Next Logistic Regression refers to the machine learning algorithm that is used to predict the probability of categorical dependent variable. In logistic regression, the dependent variable is binary variable, which consists of data coded as 1 (Boolean values of true and false). In this chapter, we will focus on developing a regression model in Python using continuous variable. The example for linear regression model will focus on data exploration from CSV file. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit. import pandas as pd import numpy as np from sklearn import preprocessing import matplotlib.pyplot as plt plt.rc(“font”, size=14) from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split import seaborn as sns sns.set(style=”white”) sns.set(style=”whitegrid”, color_codes=True) data = pd.read_csv(”bank.csv”, header=0) data = data.dropna() print(data.shape) print(list(data.columns)) Follow these steps to implement the above code in Anaconda Navigator with “Jupyter Notebook” − Step 1 − Launch the Jupyter Notebook with Anaconda Navigator. Step 2 − Upload the csv file to get the output of regression model in systematic manner. Step 3 − Create a new file and execute the above-mentioned code line to get the desired output. Print Page Previous Next Advertisements ”;

Aug 10

Extracting features with PySpark

Extracting features with PySpark ”; Previous Next In this chapter, we will learn about the application of the extracting features with PySpark in Agile Data Science. Overview of Spark Apache Spark can be defined as a fast real-time processing framework. It does computations to analyze data in real time. Apache Spark is introduced as stream processing system in real-time and can also take care of batch processing. Apache Spark supports interactive queries and iterative algorithms. Spark is written in “Scala programming language”. PySpark can be considered as a combination of Python with Spark. PySpark offers PySpark shell, which links Python API to the Spark core and initializes the Spark context. Most of the data scientists use PySpark for tracking features as discussed in the previous chapter. In this example, we will focus on the transformations to build a dataset called counts and save it to a particular file. text_file = sc.textFile(“hdfs://…”) counts = text_file.flatMap(lambda line: line.split(” “)) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile(“hdfs://…”) Using PySpark, a user can work with RDDs in python programming language. The inbuilt library, which covers the basics of Data Driven documents and components, helps in this. Print Page Previous Next Advertisements ”;

Aug 10

Agile Tools & Installation

Agile Tools and Installation ”; Previous Next In this chapter, we will learn about the different Agile tools and their installation. The development stack of agile methodology includes the following set of components − Events An event is an occurrence that happens or is logged along with its features and timestamps. An event can come in many forms like servers, sensors, financial transactions or actions, which our users take in our application. In this complete tutorial, we will use JSON files that will facilitate data exchange among different tools and languages. Collectors Collectors are event aggregators. They collect events in a systematic manner to store and aggregate bulky data queuing them for action by real time workers. Distributed document These documents include multinode (multiple nodes) which stores document in a specific format. We will focus on MongoDB in this tutorial. Web application server Web application server enables data as JSON through client through visualization, with minimal overhead. It means web application server helps to test and deploy the projects created with agile methodology. Modern Browser It enables modern browser or application to present data as an interactive tool for our users. Local Environmental Setup For managing data sets, we will focus on the Anaconda framework of python that includes tools for managing excel, csv and many more files. The dashboard of Anaconda framework once installed is as shown below. It is also called the “Anaconda Navigator” − The navigator includes the “Jupyter framework” which is a notebook system that helps to manage datasets. Once you launch the framework, it will be hosted in the browser as mentioned below − Print Page Previous Next Advertisements ”;

Aug 10

Agile Data Science – Introduction

Agile Data Science – Introduction ”; Previous Next Agile data science is an approach of using data science with agile methodology for web application development. It focusses on the output of the data science process suitable for effecting change for an organization. Data science includes building applications that describe research process with analysis, interactive visualization and now applied machine learning as well. The major goal of agile data science is to − document and guide explanatory data analysis to discover and follow the critical path to a compelling product. Agile data science is organized with the following set of principles − Continuous Iteration This process involves continuous iteration with creation tables, charts, reports and predictions. Building predictive models will require many iterations of feature engineering with extraction and production of insight. Intermediate Output This is the track list of outputs generated. It is even said that failed experiments also have output. Tracking output of every iteration will help creating better output in the next iteration. Prototype Experiments Prototype experiments involve assigning tasks and generating output as per the experiments. In a given task, we must iterate to achieve insight and these iterations can be best explained as experiments. Integration of data The software development life cycle includes different phases with data essential for − customers developers, and the business The integration of data paves way for better prospects and outputs. Pyramid data value The above pyramid value described the layers needed for “Agile data science” development. It starts with a collection of records based on the requirements and plumbing individual records. The charts are created after cleaning and aggregation of data. The aggregated data can be used for data visualization. Reports are generated with proper structure, metadata and tags of data. The second layer of pyramid from the top includes prediction analysis. The prediction layer is where more value is created but helps in creating good predictions that focus on feature engineering. The topmost layer involves actions where the value of data is driven effectively. The best illustration of this implementation is “Artificial Intelligence”. Print Page Previous Next Advertisements ”;

Aug 10

Fixing Prediction Problem

Fixing Prediction Problem ”; Previous Next In this chapter, we will focus on fixing a prediction problem with the help of a specific scenario. Consider that a company wants to automate the loan eligibility details as per the customer details provided through online application form. The details include name of customer, gender, marital status, loan amount and other mandatory details. The details are recorded in the CSV file as shown below − Execute the following code to evaluate the prediction problem − import pandas as pd from sklearn import ensemble import numpy as np from scipy.stats import mode from sklearn import preprocessing,model_selection from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder #loading the dataset data=pd.read_csv(”train.csv”,index_col=”Loan_ID”) def num_missing(x): return sum(x.isnull()) #imputing the the missing values from the data data[”Gender”].fillna(mode(list(data[”Gender”])).mode[0], inplace=True) data[”Married”].fillna(mode(list(data[”Married”])).mode[0], inplace=True) data[”Self_Employed”].fillna(mode(list(data[”Self_Employed”])).mode[0], inplace=True) # print (data.apply(num_missing, axis=0)) # #imputing mean for the missing value data[”LoanAmount”].fillna(data[”LoanAmount”].mean(), inplace=True) mapping={”0”:0,”1”:1,”2”:2,”3+”:3} data = data.replace({”Dependents”:mapping}) data[”Dependents”].fillna(data[”Dependents”].mean(), inplace=True) data[”Loan_Amount_Term”].fillna(method=”ffill”,inplace=True) data[”Credit_History”].fillna(method=”ffill”,inplace=True) print (data.apply(num_missing,axis=0)) #converting the cateogorical data to numbers using the label encoder var_mod = [”Gender”,”Married”,”Education”,”Self_Employed”,”Property_Area”,”Loan_Status”] le = LabelEncoder() for i in var_mod: le.fit(list(data[i].values)) data[i] = le.transform(list(data[i])) #Train test split x=[”Gender”,”Married”,”Education”,”Self_Employed”,”Property_Area”,”LoanAmount”, ”Loan_Amount_Term”,”Credit_History”,”Dependents”] y=[”Loan_Status”] print(data[x]) X_train,X_test,y_train,y_test=model_selection.train_test_split(data[x],data[y], test_size=0.2) # # #Random forest classifier # clf=ensemble.RandomForestClassifier(n_estimators=100, criterion=”gini”,max_depth=3,max_features=”auto”,n_jobs=-1) clf=ensemble.RandomForestClassifier(n_estimators=200,max_features=3,min_samples _split=5,oob_score=True,n_jobs=-1,criterion=”entropy”) clf.fit(X_train,y_train) accuracy=clf.score(X_test,y_test) print(accuracy) Output The above code generates the following output. Print Page Previous Next Advertisements ”;

Aug 10

Data Enrichment

Agile Data Science – Data Enrichment ”; Previous Next Data enrichment refers to a range of processes used to enhance, refine and improve raw data. It refers to useful data transformation (raw data to useful information). The process of data enrichment focusses on making data a valuable data asset for modern business or enterprise. The most common data enrichment process includes correction of spelling mistakes or typographical errors in database through use of specific decision algorithms. Data enrichment tools add useful information to simple data tables. Consider the following code for spell correction of words − import re from collections import Counter def words(text): return re.findall(r”w+”, text.lower()) WORDS = Counter(words(open(”big.txt”).read())) def P(word, N=sum(WORDS.values())): “Probabilities of words” return WORDS[word] / N def correction(word): “Spelling correction of word” return max(candidates(word), key=P) def candidates(word): “Generate possible spelling corrections for word.” return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]) def known(words): “The subset of `words` that appear in the dictionary of WORDS.” return set(w for w in words if w in WORDS) def edits1(word): “All edits that are one edit away from `word`.” letters = ”abcdefghijklmnopqrstuvwxyz” splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts) def edits2(word): “All edits that are two edits away from `word`.” return (e2 for e1 in edits1(word) for e2 in edits1(e1)) print(correction(”speling”)) print(correction(”korrectud”)) In this program, we will match with “big.txt” which includes corrected words. Words match with words included in text file and print the appropriate results accordingly. Output The above code will generate the following output − Print Page Previous Next Advertisements ”;

Aug 10

Agile Data Science – Home

Agile Data Science Tutorial PDF Version Quick Guide Resources Job Search Discussion Agile is a software development methodology that helps in building software through incremental sessions using short iterations of 1 to 4 weeks so that the development is aligned with the changing business needs. Agile Data science comprises of a combination of agile methodology and data science. In this tutorial, we have used appropriate examples to help you understand agile development and data science in a general and quick way. Audience This tutorial has been prepared for developers and project managers to help them understand the basics of agile principles and its implementation. After completing this tutorial, you will find yourself at a moderate level of expertise, from where you can advance further with implementation of data science and agile methodology. Prerequisites It is important to have basic knowledge of data science modules and software development concepts such as software requirements, coding along with testing. Print Page Previous Next Advertisements ”;

Aug 10

Improving Prediction Performance

Improving Prediction Performance ”; Previous Next In this chapter, we will focus on building a model that helps in the prediction of student’s performance with a number of attributes included in it. The focus is to display the failure result of students in an examination. Process The target value of assessment is G3. This values can be binned and further classified as failure and success. If G3 value is greater than or equal to 10, then the student passes the examination. Example Consider the following example wherein a code is executed to predict the performance if students − import pandas as pd “”” Read data file as DataFrame “”” df = pd.read_csv(“student-mat.csv”, sep=”;”) “”” Import ML helpers “”” from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest, chi2 from sklearn.svm import LinearSVC # Support Vector Machine Classifier model “”” Split Data into Training and Testing Sets “”” def split_data(X, Y): return train_test_split(X, Y, test_size=0.2, random_state=17) “”” Confusion Matrix “”” def confuse(y_true, y_pred): cm = confusion_matrix(y_true=y_true, y_pred=y_pred) # print(“nConfusion Matrix: n”, cm) fpr(cm) ffr(cm) “”” False Pass Rate “”” def fpr(confusion_matrix): fp = confusion_matrix[0][1] tf = confusion_matrix[0][0] rate = float(fp) / (fp + tf) print(“False Pass Rate: “, rate) “”” False Fail Rate “”” def ffr(confusion_matrix): ff = confusion_matrix[1][0] tp = confusion_matrix[1][1] rate = float(ff) / (ff + tp) print(“False Fail Rate: “, rate) return rate “”” Train Model and Print Score “”” def train_and_score(X, y): X_train, X_test, y_train, y_test = split_data(X, y) clf = Pipeline([ (”reduce_dim”, SelectKBest(chi2, k=2)), (”train”, LinearSVC(C=100)) ]) scores = cross_val_score(clf, X_train, y_train, cv=5, n_jobs=2) print(“Mean Model Accuracy:”, np.array(scores).mean()) clf.fit(X_train, y_train) confuse(y_test, clf.predict(X_test)) print() “”” Main Program “”” def main(): print(“nStudent Performance Prediction”) # For each feature, encode to categorical values class_le = LabelEncoder() for column in df[[“school”, “sex”, “address”, “famsize”, “Pstatus”, “Mjob”, “Fjob”, “reason”, “guardian”, “schoolsup”, “famsup”, “paid”, “activities”, “nursery”, “higher”, “internet”, “romantic”]].columns: df[column] = class_le.fit_transform(df[column].values) # Encode G1, G2, G3 as pass or fail binary values for i, row in df.iterrows(): if row[“G1”] >= 10: df[“G1”][i] = 1 else: df[“G1”][i] = 0 if row[“G2”] >= 10: df[“G2”][i] = 1 else: df[“G2”][i] = 0 if row[“G3”] >= 10: df[“G3”][i] = 1 else: df[“G3”][i] = 0 # Target values are G3 y = df.pop(“G3”) # Feature set is remaining features X = df print(“nnModel Accuracy Knowing G1 & G2 Scores”) print(“=====================================”) train_and_score(X, y) # Remove grade report 2 X.drop([“G2”], axis = 1, inplace=True) print(“nnModel Accuracy Knowing Only G1 Score”) print(“=====================================”) train_and_score(X, y) # Remove grade report 1 X.drop([“G1”], axis=1, inplace=True) print(“nnModel Accuracy Without Knowing Scores”) print(“=====================================”) train_and_score(X, y) main() Output The above code generates the output as shown below The prediction is treated with reference to only one variable. With reference to one variable, the student performance prediction is as shown below − Print Page Previous Next Advertisements ”;

Aug 10

Implementation of Agile

Agile Data Science – Implementation of Agile ”; Previous Next There are various methodologies used in the agile development process. These methodologies can be used for data science research process as well. The flowchart given below shows the different methodologies − Scrum In software development terms, scrum means managing work with a small team and management of a specific project to reveal the strength and weaknesses of the project. Crystal methodologies Crystal methodologies include innovative techniques for product management and execution. With this method, teams can go about similar tasks in different ways. Crystal family is one of the easiest methodology to apply. Dynamic Software Development Method This delivery framework is primarily used to implement the current knowledge system in software methodology. Future driven development The focus of this development life cycle is features involved in project. It works best for domain object modeling, code and feature development for ownership. Lean Software development This method aims at increasing the speed of software development at low cost and focusses the team on delivering specific value to customer. Extreme Programming Extreme programming is a unique software development methodology, which focusses on improving the software quality. This comes effective when the customer is not sure about the functionality of any project. Agile methodologies are taking root in data science stream and it is considered as the important software methodology. With agile self-organizing, cross-functional teams can work together in effective manner. As mentioned there are six main categories of agile development and each one of them can be streamed with data science as per the requirements. Data science involves an iterative process for statistical insights. Agile helps in breaking down the data science modules and helps in processing iterations and sprints in effective manner. The process of Agile Data Science is an amazing way of understanding how and why data science module is implemented. It solves problems in creative manner. Print Page Previous Next Advertisements ”;