Data Analytics – Statistical Methods

Big Data Analytics – Statistical Methods ”; Previous Next When analyzing data, it is possible to have a statistical approach. The basic tools that are needed to perform basic analysis are − Correlation analysis Analysis of Variance Hypothesis Testing When working with large datasets, it doesn’t involve a problem as these methods aren’t computationally intensive with the exception of Correlation Analysis. In this case, it is always possible to take a sample and the results should be robust. Correlation Analysis Correlation Analysis seeks to find linear relationships between numeric variables. This can be of use in different circumstances. One common use is exploratory data analysis, in section 16.0.2 of the book there is a basic example of this approach. First of all, the correlation metric used in the mentioned example is based on the Pearson coefficient. There is however, another interesting metric of correlation that is not affected by outliers. This metric is called the spearman correlation. The spearman correlation metric is more robust to the presence of outliers than the Pearson method and gives better estimates of linear relations between numeric variable when the data is not normally distributed. library(ggplot2) # Select variables that are interesting to compare pearson and spearman correlation methods. x = diamonds[, c(”x”, ”y”, ”z”, ”price”)] # From the histograms we can expect differences in the correlations of both metrics. # In this case as the variables are clearly not normally distributed, the spearman correlation # is a better estimate of the linear relation among numeric variables. par(mfrow = c(2,2)) colnm = names(x) for(i in 1:4) { hist(x[[i]], col = ”deepskyblue3”, main = sprintf(”Histogram of %s”, colnm[i])) } par(mfrow = c(1,1)) From the histograms in the following figure, we can expect differences in the correlations of both metrics. In this case, as the variables are clearly not normally distributed, the spearman correlation is a better estimate of the linear relation among numeric variables. In order to compute the correlation in R, open the file bda/part2/statistical_methods/correlation/correlation.R that has this code section. ## Correlation Matrix – Pearson and spearman cor_pearson <- cor(x, method = ”pearson”) cor_spearman <- cor(x, method = ”spearman”) ### Pearson Correlation print(cor_pearson) # x y z price # x 1.0000000 0.9747015 0.9707718 0.8844352 # y 0.9747015 1.0000000 0.9520057 0.8654209 # z 0.9707718 0.9520057 1.0000000 0.8612494 # price 0.8844352 0.8654209 0.8612494 1.0000000 ### Spearman Correlation print(cor_spearman) # x y z price # x 1.0000000 0.9978949 0.9873553 0.9631961 # y 0.9978949 1.0000000 0.9870675 0.9627188 # z 0.9873553 0.9870675 1.0000000 0.9572323 # price 0.9631961 0.9627188 0.9572323 1.0000000 Chi-squared Test The chi-squared test allows us to test if two random variables are independent. This means that the probability distribution of each variable doesn’t influence the other. In order to evaluate the test in R we need first to create a contingency table, and then pass the table to the chisq.test R function. For example, let’s check if there is an association between the variables: cut and color from the diamonds dataset. The test is formally defined as − H0: The variable cut and diamond are independent H1: The variable cut and diamond are not independent We would assume there is a relationship between these two variables by their name, but the test can give an objective “rule” saying how significant this result is or not. In the following code snippet, we found that the p-value of the test is 2.2e-16, this is almost zero in practical terms. Then after running the test doing a Monte Carlo simulation, we found that the p-value is 0.0004998 which is still quite lower than the threshold 0.05. This result means that we reject the null hypothesis (H0), so we believe the variables cut and color are not independent. library(ggplot2) # Use the table function to compute the contingency table tbl = table(diamonds$cut, diamonds$color) tbl # D E F G H I J # Fair 163 224 312 314 303 175 119 # Good 662 933 909 871 702 522 307 # Very Good 1513 2400 2164 2299 1824 1204 678 # Premium 1603 2337 2331 2924 2360 1428 808 # Ideal 2834 3903 3826 4884 3115 2093 896 # In order to run the test we just use the chisq.test function. chisq.test(tbl) # Pearson’s Chi-squared test # data: tbl # X-squared = 310.32, df = 24, p-value < 2.2e-16 # It is also possible to compute the p-values using a monte-carlo simulation # It”s needed to add the simulate.p.value = TRUE flag and the amount of simulations chisq.test(tbl, simulate.p.value = TRUE, B = 2000) # Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates) # data: tbl # X-squared = 310.32, df = NA, p-value = 0.0004998 T-test The idea of t-test is to evaluate if there are differences in a numeric variable # distribution between different groups of a nominal variable. In order to demonstrate this, I will select the levels of the Fair and Ideal levels of the factor variable cut, then we will compare the values a numeric variable among those two groups. data = diamonds[diamonds$cut %in% c(”Fair”, ”Ideal”), ] data$cut = droplevels.factor(data$cut) # Drop levels that aren’t used from the cut variable df1 = data[, c(”cut”, ”price”)] # We can see the price means are different for each group tapply(df1$price, df1$cut, mean) # Fair Ideal # 4358.758 3457.542 The t-tests are implemented in R with the t.test function. The formula interface to t.test is the simplest way to use it, the idea is that a numeric variable is explained by a group variable. For example: t.test(numeric_variable ~ group_variable, data = data). In the previous example, the numeric_variable is price and the group_variable is cut. From a statistical perspective, we are testing if there are differences in the distributions of the numeric variable among two groups. Formally the hypothesis test is described with a null (H0) hypothesis and an alternative hypothesis (H1). H0: There are no differences in the distributions of the price variable among the Fair and Ideal groups H1 There are differences

Big Data Adoption & Planning Considerations

Big Data Adoption and Planning Considerations ”; Previous Next Adopting big data comes with its own set of challenges and considerations, but with careful planning, organizations can maximize its benefits. Big Data initiatives should be strategic and business-driven. The adoption of big data can facilitate this change. The use of Big Data can be transformative, but it is usually innovative. Transformation activities are often low-risk and aim to improve efficiency and effectiveness. The nature of Big Data and its analytic power consists of issues and challenges that need to be planned in the beginning. For example, the adoption of new technology makes concerns to secure that conform to existing corporate standards needs to be addressed. Issues related to tracking the provenance of a dataset from its procurement to its utilization are often new requirements for organizations. It is necessary to plan for the management of the privacy of constituents whose data is being processed or whose identity is revealed by analytical processes. All of the aforementioned factors require that an organisation recognise and implement a set of distinct governance processes and decision frameworks to ensure that all parties involved understand the nature, consequences, and management requirements of Big Data. The approach to performing business analysis is changing with the adoption of Big Data. The Big Data analytics lifecycle is an effective solution. There are different factors to consider when we implement Big Data. Following image depicts about big data adoption and planning considerations − Big Data Adoption and Planning Considerations The primary potential big data adoption and planning considerations are as − Organization Prerequisites Big Data frameworks are not turnkey solutions. Enterprises require data management and Big Data governance frameworks for data analysis and analytics to be useful. Effective processes are required for implementing, customising, filling, and utilising Big Data solutions. Define Objectives Outline your aims and objectives for implementing big data. Whether it”s increasing the customer experience, optimising processes, or improving decision-making, defined objectives always give a positive direction to the decision-makers to frame strategy. Data Procurement The acquisition of Big Data solutions can be cost-effective, due to the availability of open-source platforms and tools, as well as the potential to leverage commodity hardware. A substantial budget may still be required to obtain external data. Most commercially relevant data will have to be purchased, which may necessitate continuing subscription expenses to ensure the delivery of updates to obtained datasets. Infrastructure Evaluate your current infrastructure to see if it can handle big data processing and analytics. Consider whether you need to invest in new hardware, software, or cloud-based solutions to manage the volume, velocity, and variety of data. Data Strategy Create a comprehensive data strategy that is aligned with your business objectives. This includes determining what sorts of data are required, where to obtain them, how to store and manage them, and how to ensure their quality and security. Data Privacy and Security Analytics on datasets may reveal confidential data about organisations or individuals. Analyzing different datasets includes benign data that can reveal private information when the datasets are reviewed collectively. Addressing these privacy concerns necessitates an awareness of the nature of the data being collected, as well as relevant data privacy rules and particular procedures for data tagging and anonymization. Telemetry data, such as a car”s GPS record or smart metre data readings, accumulated over a long period, might expose an individual”s location and behavior. Security ensures the security of data networks and repositories using authentication and authorization mechanisms is an essential element in securing big data. Provenance Provenance refers to information about the data”s origins and processing. Provenance information is used to determine the validity and quality of data and can also be used for auditing. It can be difficult to maintain provenance as a large size of data is collected, integrated, and processed using different phases. Limited Realtime Support Dashboards and other applications that require streaming data and alerts frequently require real-time or near-realtime data transmissions. Different open-source Big Data solutions and tools are batch-oriented; however, a new phase of real-time open-source technologies supports streaming data processing. Distinct Performance Challenges With the large amounts of data that Big Data solutions must handle, performance is frequently an issue. For example, massive datasets combined with advanced search algorithms can lead to long query times. Distinct Governance Requirements Big Data solutions access and generate data, which become corporate assets. A governance structure is essential to ensure that both the data and the solution environment are regulated, standardized, and evolved in a controlled way. Establish strong data governance policies to assure data quality, integrity, privacy, and compliance with legislation like GDPR and CCPA. Define data management roles and responsibilities, as well as data access, usage, and security processes. Distinct Methodology A mechanism will be necessary to govern the flow of data into and out of Big Data systems. It will need to explore how to construct feedback loops so that processed data can be revised again. Continuous Improvement Big data initiatives are iterative, and require on-going development over time. Monitor performance indicators, get feedback, and fine-tune your strategy to ensure that you”re getting the most out of your data investments. By carefully examining and planning for these factors, organisations can successfully adopt and exploit big data to drive innovation, enhance efficiency, and gain a competitive advantage in today”s data-driven world. Print Page Previous Next Advertisements ”;

Big Data Analytics – Data Analyst

Big Data Analytics – Data Analyst ”; Previous Next A Data Analyst is a person who collects, analyses and interprets data to solve a particular problem. A data analyst devotes a lot of time to examining the data and finds insights in terms of graphical reports and dashboards. Hence, a data analyst has a reporting-oriented profile and has experience in extracting and analyzing data from traditional data warehouses using SQL. Working as a data analyst in big data analytics sounds like a dynamic role. Big data analytics includes analysing large-size and varied datasets to discover hidden patterns, unknown relationships, market trends, customer needs, and related valuable business insights. In today’s scenario, different organizations struggle hard to find competent data scientists in the market. It is however a good idea to select prospective data analysts and train them to the relevant skills to become a data scientist. A competent data analyst has skills like business understanding, SQL programming, report design and Dashboard creation. Role and Responsibilities of Data Analyst Below mentioned image incorporate all the major roles and responsibilities of a data analyst − Data Collection It refers to a process of collecting data from different sources like databases, data warehouses, APIs, and IoT devices. This could include conducting surveys, tracking visitor behaviour on a company”s website, or buying relevant data sets from data collection specialists. Data Cleaning and Pre-processing There may be duplicates, errors or outliers in the raw data. Cleaning raw data eliminates errors, inconsistencies, and duplicates. Pre-processing is the process of converting data into an analytically useful format. Cleaning data entails maintaining data quality in a spreadsheet or using a programming language to ensure that your interpretations are correct and unbiased. Exploratory Data Analysis (EDA) Using statistical methods and visualization tools, analysis of data is carried out to identify trends, patterns or relationships. Model Data It includes creating and designing database structures. Selection of type of data is going to be stored and collected. It ensures that how data categories are related and data appears. Statistical Analysis Applying statistical techniques to interpret data, validate hypotheses, and make predictions. Machine Learning To predict future trends, classify data or detect anomalies by building predictive models using machine learning algorithms. Data Visualization To communicate data insights effectively to stakeholders, it is necessary to create visual representations such as charts, graphs and dashboards. Data Interpretation and Reporting To communicate findings and recommendations to decision-makers through the interpretation of analysis results, and preparation of reports or presentations. Continuous Learning It includes keeping up to date with the latest developments in data analysis, big data technologies and business trends. A Data analyst makes their proficiency foundation in statistics, programming languages like Python or R, database fundamentals, SQL, and big data technologies such as Hadoop, Spark, and NoSQL databases. What Tools Does a Data Analyst Use? A data analyst often uses the following tools to process assigned work more accurately and efficiently during data analysis. Some common tools used by data analysts are mentioned in below image − Types of Data Analysts As technology has rapidly increasing; so, the types and amounts of data that can be collected, classified, and analyse data has become an essential skill in almost every business. In the current scenario; every domain has data analysts experts like data analysts in the criminal justice, fashion, food, technology, business, environment, and public sectors amongst many others. People who perform data analysis might be known as − Medical and health care analyst Market research analyst Business analyst Business intelligence analyst Operations research analyst Data Analyst Skills Generally, the skills of data analysts are divided into two major groups” i.e. Technical Skills and Behavioural Skills. Data Analyst Technical Skills Data Cleaning − A data analyst has proficiency in identifying and handling missing data, outliers, and errors in datasets. Database Tools − Microsoft Excel and SQL are essential tools for any data analyst. Excel is most widely used in industries; while SQL is capable of handling larger datasets using SQL queries to manipulate and manage data as per user’s needs. Programming Languages − Data Analysts are proficient in languages such as Python, R, SQL, or others used for data manipulation, analysis, and visualization. Learning Python or R makes me proficient in working on large-size data sets and complex equations. Python and R are popular to work on data analysis. Data Visualisation − A competent data analyst must clearly and compellingly present their findings. Knowing how to show data in charts and graphs will help coworkers, employers, and stakeholders comprehend your job. Some popular data visualization tools are Tableau, Jupyter Notebook, and Excel. Data Storytelling − Data Analysts can find and communicate insights effectively through storytelling using data visualization and narrative techniques. Statistics and Maths − Statistical methods and tools are used to analyse data distributions, correlations, and trends. Knowledge of statistics and maths can guide us to determine which tools are best to use to solve a particular problem, identify errors in data, and better understand the results. Big Data Tools − Data Analysts are familiar with big data processing tools and frameworks like Hadoop, Spark, or Apache Kafka. Data Warehousing − Data Analysts also have an understanding of data warehousing concepts and work with tools such as Amazon Redshift, Google BigQuery, or Snowflake. Data Governance and Compliance − Data Analysts are aware of data governance principles, data privacy laws, and regulations (Like GDPR, and HIPAA). APIs and Web Scraping − Data Analysts have expertise in pulling data from web APIs and scraping from websites using libraries like requests (Python) or BeautifulSoup. Behavioural Skills Problem-solving − A data analyst can understand the problem that

Big Data Analytics – Characteristics

Big Data Analytics – Characteristics ”; Previous Next Big Data refers to extremely large data sets that may be analyzed to reveal patterns, trends, and associations, especially relating to human behaviour and interactions. Big Data Characteristics The characteristics of Big Data, often summarized by the “Five V”s,” include − Volume As its name implies; volume refers to a large size of data generated and stored every second using IoT devices, social media, videos, financial transactions, and customer logs. The data generated from the devices or different sources can range from terabytes to petabytes and beyond. To manage such large quantities of data requires robust storage solutions and advanced data processing techniques. The Hadoop framework is used to store, access and process big data. Facebook generates 4 petabytes of data per day that”s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data [1]. Fig: Minutes spent per day on social apps (Image source: Recode) Fig: Engagement per user on leading social media apps in India (Image source: www.statista.com) [2] From the above graph, we can predict how users are devoting their time to accessing different channels and transforming data, hence, data volume is becoming higher day by day. Velocity The speed with which data is generated, processed, and analysed. With the development and usage of IoT devices and real-time data streams, the velocity of data has expanded tremendously, demanding systems that can process data instantly to derive meaningful insights. Some high-velocity data applications are as follows − Variety Big Data includes different types of data like structured data (found in databases), unstructured data (like text, images, videos), and semi-structured data (like JSON and XML). This diversity requires advanced tools for data integration, storage, and analysis. Challenges of Managing Variety in Big Data − Variety in Big Data Applications − Veracity Veracity refers accuracy and trustworthiness of the data. Ensuring data quality, addressing data discrepancies, and dealing with data ambiguity are all major issues in Big Data analytics. Value The ability to convert large volumes of data into useful insights. Big Data”s ultimate goal is to extract meaningful and actionable insights that can lead to better decision-making, new products, enhanced consumer experiences, and competitive advantages. These qualities characterise the nature of Big Data and highlight the importance of modern tools and technologies for effective data management, processing, and analysis. Print Page Previous Next Advertisements ”;

Cassandra – Create Data

Cassandra – Create Data ”; Previous Next Creating Data in a Table You can insert data into the columns of a row in a table using the command INSERT. Given below is the syntax for creating data in a table. INSERT INTO <tablename> (<column1 name>, <column2 name>….) VALUES (<value1>, <value2>….) USING <option> Example Let us assume there is a table called emp with columns (emp_id, emp_name, emp_city, emp_phone, emp_sal) and you have to insert the following data into the emp table. emp_id emp_name emp_city emp_phone emp_sal 1 ram Hyderabad 9848022338 50000 2 robin Hyderabad 9848022339 40000 3 rahman Chennai 9848022330 45000 Use the commands given below to fill the table with required data. cqlsh:tutorialspoint> INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,”ram”, ”Hyderabad”, 9848022338, 50000); cqlsh:tutorialspoint> INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(2,”robin”, ”Hyderabad”, 9848022339, 40000); cqlsh:tutorialspoint> INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(3,”rahman”, ”Chennai”, 9848022330, 45000); Verification After inserting data, use SELECT statement to verify whether the data has been inserted or not. If you verify the emp table using SELECT statement, it will give you the following output. cqlsh:tutorialspoint> SELECT * FROM emp; emp_id | emp_city | emp_name | emp_phone | emp_sal ——–+———–+———-+————+——— 1 | Hyderabad | ram | 9848022338 | 50000 2 | Hyderabad | robin | 9848022339 | 40000 3 | Chennai | rahman | 9848022330 | 45000 (3 rows) Here you can observe the table has populated with the data we inserted. Creating Data using Java API You can create data in a table using the execute() method of Session class. Follow the steps given below to create data in a table using java API. Step1: Create a Cluster Object Create an instance of Cluster.builder class of com.datastax.driver.core package as shown below. //Creating Cluster.Builder object Cluster.Builder builder1 = Cluster.builder(); Add a contact point (IP address of the node) using the addContactPoint() method of Cluster.Builder object. This method returns Cluster.Builder. //Adding contact point to the Cluster.Builder object Cluster.Builder builder2 = build.addContactPoint(“127.0.0.1”); Using the new builder object, create a cluster object. To do so, you have a method called build() in the Cluster.Builder class. The following code shows how to create a cluster object. //Building a cluster Cluster cluster = builder.build(); You can build a cluster object using a single line of code as shown below. Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1″).build(); Step 2: Create a Session Object Create an instance of Session object using the connect() method of Cluster class as shown below. Session session = cluster.connect( ); This method creates a new session and initializes it. If you already have a keyspace, then you can set it to the existing one by passing the KeySpace name in string format to this method as shown below. Session session = cluster.connect(“ Your keyspace name ” ); Here we are using the KeySpace called tp. Therefore, create the session object as shown below. Session session = cluster.connect(“ tp” ); Step 3: Execute Query You can execute CQL queries using the execute() method of Session class. Pass the query either in string format or as a Statement class object to the execute() method. Whatever you pass to this method in string format will be executed on the cqlsh. In the following example, we are inserting data in a table called emp. You have to store the query in a string variable and pass it to the execute() method as shown below. String query1 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,”ram”, ”Hyderabad”, 9848022338, 50000);” ; String query2 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(2,”robin”, ”Hyderabad”, 9848022339, 40000);” ; String query3 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(3,”rahman”, ”Chennai”, 9848022330, 45000);” ; session.execute(query1); session.execute(query2); session.execute(query3); Given below is the complete program to insert data into a table in Cassandra using Java API. import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; public class Create_Data { public static void main(String args[]){ //queries String query1 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal)” + ” VALUES(1,”ram”, ”Hyderabad”, 9848022338, 50000);” ; String query2 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal)” + ” VALUES(2,”robin”, ”Hyderabad”, 9848022339, 40000);” ; String query3 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal)” + ” VALUES(3,”rahman”, ”Chennai”, 9848022330, 45000);” ; //Creating Cluster object Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); //Creating Session object Session session = cluster.connect(“tp”); //Executing the query session.execute(query1); session.execute(query2); session.execute(query3); System.out.println(“Data created”); } } Save the above program with the class name followed by .java, browse to the location where it is saved. Compile and execute the program as shown below. $javac Create_Data.java $java Create_Data Under normal conditions, it should produce the following output − Data created Print Page Previous Next Advertisements ”;

AWS Quicksight – Creating Story

AWS Quicksight – Creating Story ”; Previous Next Story is an option wherein you capture a series of screens and play them one by one. For example, if you want to see a visual with different filter options, you can use story. To create a Story, click on Story on the leftmost panel. By default, there is a story with name Storyboard 1. Now capture the screen using the capture icon at the right most panel on the top. Each capture of the screen is also referred to as Scene. You can capture multiple scenes and those will get added under “Storyboard 1”. The data in the story gets automatically refreshed once your main data source is refreshed. Print Page Previous Next Advertisements ”;

Apache Solr – Home

Apache Solr Tutorial PDF Version Quick Guide Resources Job Search Discussion Solr is a scalable, ready to deploy, search/storage engine optimized to search large volumes of text-centric data. Solr is enterprise-ready, fast and highly scalable. In this tutorial, we are going to learn the basics of Solr and how you can use it in practice. Audience This tutorial will be helpful for all those developers who would like to understand the basic functionalities of Apache Solr in order to develop sophisticated and high-performing applications. Prerequisites Before proceeding with this tutorial, we expect that the reader has good Java programming skills (although it is not mandatory) and some prior exposure to Lucene and Hadoop environment. Print Page Previous Next Advertisements ”;

Apache Solr – Overview

Apache Solr – Overview ”; Previous Next Solr is an open-source search platform which is used to build search applications. It was built on top of Lucene (full text search engine). Solr is enterprise-ready, fast and highly scalable. The applications built using Solr are sophisticated and deliver high performance. It was Yonik Seely who created Solr in 2004 in order to add search capabilities to the company website of CNET Networks. In Jan 2006, it was made an open-source project under Apache Software Foundation. Its latest version, Solr 6.0, was released in 2016 with support for execution of parallel SQL queries. Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. Not only search, Solr can also be used for storage purpose. Like other NoSQL databases, it is a non-relational data storage and processing technology. In short, Solr is a scalable, ready to deploy, search/storage engine optimized to search large volumes of text-centric data. Features of Apache Solr Solr is a wrap around Lucene’s Java API. Therefore, using Solr, you can leverage all the features of Lucene. Let us take a look at some of most prominent features of Solr − Restful APIs − To communicate with Solr, it is not mandatory to have Java programming skills. Instead you can use restful services to communicate with it. We enter documents in Solr in file formats like XML, JSON and .CSV and get results in the same file formats. Full text search − Solr provides all the capabilities needed for a full text search such as tokens, phrases, spell check, wildcard, and auto-complete. Enterprise ready − According to the need of the organization, Solr can be deployed in any kind of systems (big or small) such as standalone, distributed, cloud, etc. Flexible and Extensible − By extending the Java classes and configuring accordingly, we can customize the components of Solr easily. NoSQL database − Solr can also be used as big data scale NOSQL database where we can distribute the search tasks along a cluster. Admin Interface − Solr provides an easy-to-use, user friendly, feature powered, user interface, using which we can perform all the possible tasks such as manage logs, add, delete, update and search documents. Highly Scalable − While using Solr with Hadoop, we can scale its capacity by adding replicas. Text-Centric and Sorted by Relevance − Solr is mostly used to search text documents and the results are delivered according to the relevance with the user’s query in order. Unlike Lucene, you don’t need to have Java programming skills while working with Apache Solr. It provides a wonderful ready-to-deploy service to build a search box featuring autocomplete, which Lucene doesn’t provide. Using Solr, we can scale, distribute, and manage index, for large scale (Big Data) applications. Lucene in Search Applications Lucene is simple yet powerful Java-based search library. It can be used in any application to add search capability. Lucene is a scalable and high-performance library used to index and search virtually any kind of text. Lucene library provides the core operations which are required by any search application, such as Indexing and Searching. If we have a web portal with a huge volume of data, then we will most probably require a search engine in our portal to extract relevant information from the huge pool of data. Lucene works as the heart of any search application and provides the vital operations pertaining to indexing and searching. Print Page Previous Next Advertisements ”;

Apache Solr – On Hadoop

Apache Solr – On Hadoop ”; Previous Next Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. In this section, let us understand how you can install Hadoop on your system. Downloading Hadoop Given below are the steps to be followed to download Hadoop onto your system. Step 1 − Go to the homepage of Hadoop. You can use the link − www.hadoop.apache.org/. Click the link Releases, as highlighted in the following screenshot. It will redirect you to the Apache Hadoop Releases page which contains links for mirrors of source and binary files of various versions of Hadoop as follows − Step 2 − Select the latest version of Hadoop (in our tutorial, it is 2.6.4) and click its binary link. It will take you to a page where mirrors for Hadoop binary are available. Click one of these mirrors to download Hadoop. Download Hadoop from Command Prompt Open Linux terminal and login as super-user. $ su password: Go to the directory where you need to install Hadoop, and save the file there using the link copied earlier, as shown in the following code block. # cd /usr/local # wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop- 2.6.4/hadoop-2.6.4.tar.gz After downloading Hadoop, extract it using the following commands. # tar zxvf hadoop-2.6.4.tar.gz # mkdir hadoop # mv hadoop-2.6.4/* to hadoop/ # exit Installing Hadoop Follow the steps given below to install Hadoop in pseudo-distributed mode. Step 1: Setting Up Hadoop You can set the Hadoop environment variables by appending the following commands to ~/.bashrc file. export HADOOP_HOME = /usr/local/hadoop export HADOOP_MAPRED_HOME = $HADOOP_HOME export HADOOP_COMMON_HOME = $HADOOP_HOME export HADOOP_HDFS_HOME = $HADOOP_HOME export YARN_HOME = $HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR = $HADOOP_HOME/lib/native export PATH = $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL = $HADOOP_HOME Next, apply all the changes into the current running system. $ source ~/.bashrc Step 2: Hadoop Configuration You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure. $ cd $HADOOP_HOME/etc/hadoop In order to develop Hadoop programs in Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system. export JAVA_HOME = /usr/local/jdk1.7.0_71 The following are the list of files that you have to edit to configure Hadoop − core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml core-site.xml The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers. Open the core-site.xml and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure. Let us assume the following data. dfs.replication (data replication value) = 1 (In the below given path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) namenode path = //home/hadoop/hadoopinfra/hdfs/namenode (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path = //home/hadoop/hadoopinfra/hdfs/datanode Open this file and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value> </property> </configuration> Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure. yarn-site.xml This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> mapred-site.xml This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command. $ cp mapred-site.xml.template mapred-site.xml Open mapred-site.xml file and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Verifying Hadoop Installation The following steps are used to verify the Hadoop installation. Step 1: Name Node Setup Set up the namenode using the command “hdfs namenode –format” as follows. $ cd ~ $ hdfs namenode -format The expected result is as follows. 10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.4 … … 10/24/14 21:30:56 INFO common.Storage: Storage directory /home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 ************************************************************/ Step 2: Verifying the Hadoop dfs The following command is used to start the Hadoop dfs. Executing this command will start your Hadoop file system. $ start-dfs.sh The expected output is as follows − 10/24/14 21:37:56 Starting namenodes on [localhost] localhost: starting namenode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-namenode-localhost.out localhost: starting datanode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-datanode-localhost.out Starting secondary namenodes [0.0.0.0] Step 3: Verifying the Yarn Script The following command is used to start the Yarn script. Executing this command will start your Yarn demons. $ start-yarn.sh The expected output as follows − starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop-2.6.4/logs/yarn- hadoop-resourcemanager-localhost.out localhost: starting nodemanager, logging to /home/hadoop/hadoop- 2.6.4/logs/yarn-hadoop-nodemanager-localhost.out Step 4: Accessing Hadoop on Browser The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on browser. http://localhost:50070/ Installing Solr on Hadoop Follow the steps given below to download and install Solr. Step 1 Open the homepage of Apache Solr by clicking the following link − https://lucene.apache.org/solr/ Step 2 Click the download button (highlighted in the above screenshot). On clicking, you will be redirected to the page where you have various mirrors of Apache Solr. Select a mirror and click on it, which will redirect you to a page where

Apache Solr – Core

Apache Solr – Core ”; Previous Next A Solr Core is a running instance of a Lucene index that contains all the Solr configuration files required to use it. We need to create a Solr Core to perform operations like indexing and analyzing. A Solr application may contain one or multiple cores. If necessary, two cores in a Solr application can communicate with each other. Creating a Core After installing and starting Solr, you can connect to the client (web interface) of Solr. As highlighted in the following screenshot, initially there are no cores in Apache Solr. Now, we will see how to create a core in Solr. Using create command One way to create a core is to create a schema-less core using the create command, as shown below − [Hadoop@localhost bin]$ ./Solr create -c Solr_sample Here, we are trying to create a core named Solr_sample in Apache Solr. This command creates a core displaying the following message. Copying configuration to new core instance directory: /home/Hadoop/Solr/server/Solr/Solr_sample Creating new core ”Solr_sample” using command: http://localhost:8983/Solr/admin/cores?action=CREATE&name=Solr_sample&instanceD ir = Solr_sample { “responseHeader”:{ “status”:0, “QTime”:11550 }, “core”:”Solr_sample” } You can create multiple cores in Solr. On the left-hand side of the Solr Admin, you can see a core selector where you can select the newly created core, as shown in the following screenshot. Using create_core command Alternatively, you can create a core using the create_core command. This command has the following options − –c core_name Name of the core you wanted to create -p port_name Port at which you want to create the core -d conf_dir Configuration directory of the port Let’s see how you can use the create_core command. Here, we will try to create a core named my_core. [Hadoop@localhost bin]$ ./Solr create_core -c my_core On executing, the above command creates a core displaying the following message − Copying configuration to new core instance directory: /home/Hadoop/Solr/server/Solr/my_core Creating new core ”my_core” using command: http://localhost:8983/Solr/admin/cores?action=CREATE&name=my_core&instanceD ir = my_core { “responseHeader”:{ “status”:0, “QTime”:1346 }, “core”:”my_core” } Deleting a Core You can delete a core using the delete command of Apache Solr. Let’s suppose we have a core named my_core in Solr, as shown in the following screenshot. You can delete this core using the delete command by passing the name of the core to this command as follows − [Hadoop@localhost bin]$ ./Solr delete -c my_core On executing the above command, the specified core will be deleted displaying the following message. Deleting core ”my_core” using command: http://localhost:8983/Solr/admin/cores?action=UNLOAD&core = my_core&deleteIndex = true&deleteDataDir = true&deleteInstanceDir = true { “responseHeader” :{ “status”:0, “QTime”:170 } } You can open the web interface of Solr to verify whether the core has been deleted or not. Print Page Previous Next Advertisements ”;