Big Data Adoption and Planning Considerations ”; Previous Next Adopting big data comes with its own set of challenges and considerations, but with careful planning, organizations can maximize its benefits. Big Data initiatives should be strategic and business-driven. The adoption of big data can facilitate this change. The use of Big Data can be transformative, but it is usually innovative. Transformation activities are often low-risk and aim to improve efficiency and effectiveness. The nature of Big Data and its analytic power consists of issues and challenges that need to be planned in the beginning. For example, the adoption of new technology makes concerns to secure that conform to existing corporate standards needs to be addressed. Issues related to tracking the provenance of a dataset from its procurement to its utilization are often new requirements for organizations. It is necessary to plan for the management of the privacy of constituents whose data is being processed or whose identity is revealed by analytical processes. All of the aforementioned factors require that an organisation recognise and implement a set of distinct governance processes and decision frameworks to ensure that all parties involved understand the nature, consequences, and management requirements of Big Data. The approach to performing business analysis is changing with the adoption of Big Data. The Big Data analytics lifecycle is an effective solution. There are different factors to consider when we implement Big Data. Following image depicts about big data adoption and planning considerations − Big Data Adoption and Planning Considerations The primary potential big data adoption and planning considerations are as − Organization Prerequisites Big Data frameworks are not turnkey solutions. Enterprises require data management and Big Data governance frameworks for data analysis and analytics to be useful. Effective processes are required for implementing, customising, filling, and utilising Big Data solutions. Define Objectives Outline your aims and objectives for implementing big data. Whether it”s increasing the customer experience, optimising processes, or improving decision-making, defined objectives always give a positive direction to the decision-makers to frame strategy. Data Procurement The acquisition of Big Data solutions can be cost-effective, due to the availability of open-source platforms and tools, as well as the potential to leverage commodity hardware. A substantial budget may still be required to obtain external data. Most commercially relevant data will have to be purchased, which may necessitate continuing subscription expenses to ensure the delivery of updates to obtained datasets. Infrastructure Evaluate your current infrastructure to see if it can handle big data processing and analytics. Consider whether you need to invest in new hardware, software, or cloud-based solutions to manage the volume, velocity, and variety of data. Data Strategy Create a comprehensive data strategy that is aligned with your business objectives. This includes determining what sorts of data are required, where to obtain them, how to store and manage them, and how to ensure their quality and security. Data Privacy and Security Analytics on datasets may reveal confidential data about organisations or individuals. Analyzing different datasets includes benign data that can reveal private information when the datasets are reviewed collectively. Addressing these privacy concerns necessitates an awareness of the nature of the data being collected, as well as relevant data privacy rules and particular procedures for data tagging and anonymization. Telemetry data, such as a car”s GPS record or smart metre data readings, accumulated over a long period, might expose an individual”s location and behavior. Security ensures the security of data networks and repositories using authentication and authorization mechanisms is an essential element in securing big data. Provenance Provenance refers to information about the data”s origins and processing. Provenance information is used to determine the validity and quality of data and can also be used for auditing. It can be difficult to maintain provenance as a large size of data is collected, integrated, and processed using different phases. Limited Realtime Support Dashboards and other applications that require streaming data and alerts frequently require real-time or near-realtime data transmissions. Different open-source Big Data solutions and tools are batch-oriented; however, a new phase of real-time open-source technologies supports streaming data processing. Distinct Performance Challenges With the large amounts of data that Big Data solutions must handle, performance is frequently an issue. For example, massive datasets combined with advanced search algorithms can lead to long query times. Distinct Governance Requirements Big Data solutions access and generate data, which become corporate assets. A governance structure is essential to ensure that both the data and the solution environment are regulated, standardized, and evolved in a controlled way. Establish strong data governance policies to assure data quality, integrity, privacy, and compliance with legislation like GDPR and CCPA. Define data management roles and responsibilities, as well as data access, usage, and security processes. Distinct Methodology A mechanism will be necessary to govern the flow of data into and out of Big Data systems. It will need to explore how to construct feedback loops so that processed data can be revised again. Continuous Improvement Big data initiatives are iterative, and require on-going development over time. Monitor performance indicators, get feedback, and fine-tune your strategy to ensure that you”re getting the most out of your data investments. By carefully examining and planning for these factors, organisations can successfully adopt and exploit big data to drive innovation, enhance efficiency, and gain a competitive advantage in today”s data-driven world. Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Big Data Analytics – Data Analyst ”; Previous Next A Data Analyst is a person who collects, analyses and interprets data to solve a particular problem. A data analyst devotes a lot of time to examining the data and finds insights in terms of graphical reports and dashboards. Hence, a data analyst has a reporting-oriented profile and has experience in extracting and analyzing data from traditional data warehouses using SQL. Working as a data analyst in big data analytics sounds like a dynamic role. Big data analytics includes analysing large-size and varied datasets to discover hidden patterns, unknown relationships, market trends, customer needs, and related valuable business insights. In today’s scenario, different organizations struggle hard to find competent data scientists in the market. It is however a good idea to select prospective data analysts and train them to the relevant skills to become a data scientist. A competent data analyst has skills like business understanding, SQL programming, report design and Dashboard creation. Role and Responsibilities of Data Analyst Below mentioned image incorporate all the major roles and responsibilities of a data analyst − Data Collection It refers to a process of collecting data from different sources like databases, data warehouses, APIs, and IoT devices. This could include conducting surveys, tracking visitor behaviour on a company”s website, or buying relevant data sets from data collection specialists. Data Cleaning and Pre-processing There may be duplicates, errors or outliers in the raw data. Cleaning raw data eliminates errors, inconsistencies, and duplicates. Pre-processing is the process of converting data into an analytically useful format. Cleaning data entails maintaining data quality in a spreadsheet or using a programming language to ensure that your interpretations are correct and unbiased. Exploratory Data Analysis (EDA) Using statistical methods and visualization tools, analysis of data is carried out to identify trends, patterns or relationships. Model Data It includes creating and designing database structures. Selection of type of data is going to be stored and collected. It ensures that how data categories are related and data appears. Statistical Analysis Applying statistical techniques to interpret data, validate hypotheses, and make predictions. Machine Learning To predict future trends, classify data or detect anomalies by building predictive models using machine learning algorithms. Data Visualization To communicate data insights effectively to stakeholders, it is necessary to create visual representations such as charts, graphs and dashboards. Data Interpretation and Reporting To communicate findings and recommendations to decision-makers through the interpretation of analysis results, and preparation of reports or presentations. Continuous Learning It includes keeping up to date with the latest developments in data analysis, big data technologies and business trends. A Data analyst makes their proficiency foundation in statistics, programming languages like Python or R, database fundamentals, SQL, and big data technologies such as Hadoop, Spark, and NoSQL databases. What Tools Does a Data Analyst Use? A data analyst often uses the following tools to process assigned work more accurately and efficiently during data analysis. Some common tools used by data analysts are mentioned in below image − Types of Data Analysts As technology has rapidly increasing; so, the types and amounts of data that can be collected, classified, and analyse data has become an essential skill in almost every business. In the current scenario; every domain has data analysts experts like data analysts in the criminal justice, fashion, food, technology, business, environment, and public sectors amongst many others. People who perform data analysis might be known as − Medical and health care analyst Market research analyst Business analyst Business intelligence analyst Operations research analyst Data Analyst Skills Generally, the skills of data analysts are divided into two major groups” i.e. Technical Skills and Behavioural Skills. Data Analyst Technical Skills Data Cleaning − A data analyst has proficiency in identifying and handling missing data, outliers, and errors in datasets. Database Tools − Microsoft Excel and SQL are essential tools for any data analyst. Excel is most widely used in industries; while SQL is capable of handling larger datasets using SQL queries to manipulate and manage data as per user’s needs. Programming Languages − Data Analysts are proficient in languages such as Python, R, SQL, or others used for data manipulation, analysis, and visualization. Learning Python or R makes me proficient in working on large-size data sets and complex equations. Python and R are popular to work on data analysis. Data Visualisation − A competent data analyst must clearly and compellingly present their findings. Knowing how to show data in charts and graphs will help coworkers, employers, and stakeholders comprehend your job. Some popular data visualization tools are Tableau, Jupyter Notebook, and Excel. Data Storytelling − Data Analysts can find and communicate insights effectively through storytelling using data visualization and narrative techniques. Statistics and Maths − Statistical methods and tools are used to analyse data distributions, correlations, and trends. Knowledge of statistics and maths can guide us to determine which tools are best to use to solve a particular problem, identify errors in data, and better understand the results. Big Data Tools − Data Analysts are familiar with big data processing tools and frameworks like Hadoop, Spark, or Apache Kafka. Data Warehousing − Data Analysts also have an understanding of data warehousing concepts and work with tools such as Amazon Redshift, Google BigQuery, or Snowflake. Data Governance and Compliance − Data Analysts are aware of data governance principles, data privacy laws, and regulations (Like GDPR, and HIPAA). APIs and Web Scraping − Data Analysts have expertise in pulling data from web APIs and scraping from websites using libraries like requests (Python) or BeautifulSoup. Behavioural Skills Problem-solving − A data analyst can understand the problem that
Big Data Analytics – Characteristics ”; Previous Next Big Data refers to extremely large data sets that may be analyzed to reveal patterns, trends, and associations, especially relating to human behaviour and interactions. Big Data Characteristics The characteristics of Big Data, often summarized by the “Five V”s,” include − Volume As its name implies; volume refers to a large size of data generated and stored every second using IoT devices, social media, videos, financial transactions, and customer logs. The data generated from the devices or different sources can range from terabytes to petabytes and beyond. To manage such large quantities of data requires robust storage solutions and advanced data processing techniques. The Hadoop framework is used to store, access and process big data. Facebook generates 4 petabytes of data per day that”s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data [1]. Fig: Minutes spent per day on social apps (Image source: Recode) Fig: Engagement per user on leading social media apps in India (Image source: www.statista.com) [2] From the above graph, we can predict how users are devoting their time to accessing different channels and transforming data, hence, data volume is becoming higher day by day. Velocity The speed with which data is generated, processed, and analysed. With the development and usage of IoT devices and real-time data streams, the velocity of data has expanded tremendously, demanding systems that can process data instantly to derive meaningful insights. Some high-velocity data applications are as follows − Variety Big Data includes different types of data like structured data (found in databases), unstructured data (like text, images, videos), and semi-structured data (like JSON and XML). This diversity requires advanced tools for data integration, storage, and analysis. Challenges of Managing Variety in Big Data − Variety in Big Data Applications − Veracity Veracity refers accuracy and trustworthiness of the data. Ensuring data quality, addressing data discrepancies, and dealing with data ambiguity are all major issues in Big Data analytics. Value The ability to convert large volumes of data into useful insights. Big Data”s ultimate goal is to extract meaningful and actionable insights that can lead to better decision-making, new products, enhanced consumer experiences, and competitive advantages. These qualities characterise the nature of Big Data and highlight the importance of modern tools and technologies for effective data management, processing, and analysis. Print Page Previous Next Advertisements ”;
Cassandra – Create Data
Cassandra – Create Data ”; Previous Next Creating Data in a Table You can insert data into the columns of a row in a table using the command INSERT. Given below is the syntax for creating data in a table. INSERT INTO <tablename> (<column1 name>, <column2 name>….) VALUES (<value1>, <value2>….) USING <option> Example Let us assume there is a table called emp with columns (emp_id, emp_name, emp_city, emp_phone, emp_sal) and you have to insert the following data into the emp table. emp_id emp_name emp_city emp_phone emp_sal 1 ram Hyderabad 9848022338 50000 2 robin Hyderabad 9848022339 40000 3 rahman Chennai 9848022330 45000 Use the commands given below to fill the table with required data. cqlsh:tutorialspoint> INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,”ram”, ”Hyderabad”, 9848022338, 50000); cqlsh:tutorialspoint> INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(2,”robin”, ”Hyderabad”, 9848022339, 40000); cqlsh:tutorialspoint> INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(3,”rahman”, ”Chennai”, 9848022330, 45000); Verification After inserting data, use SELECT statement to verify whether the data has been inserted or not. If you verify the emp table using SELECT statement, it will give you the following output. cqlsh:tutorialspoint> SELECT * FROM emp; emp_id | emp_city | emp_name | emp_phone | emp_sal ——–+———–+———-+————+——— 1 | Hyderabad | ram | 9848022338 | 50000 2 | Hyderabad | robin | 9848022339 | 40000 3 | Chennai | rahman | 9848022330 | 45000 (3 rows) Here you can observe the table has populated with the data we inserted. Creating Data using Java API You can create data in a table using the execute() method of Session class. Follow the steps given below to create data in a table using java API. Step1: Create a Cluster Object Create an instance of Cluster.builder class of com.datastax.driver.core package as shown below. //Creating Cluster.Builder object Cluster.Builder builder1 = Cluster.builder(); Add a contact point (IP address of the node) using the addContactPoint() method of Cluster.Builder object. This method returns Cluster.Builder. //Adding contact point to the Cluster.Builder object Cluster.Builder builder2 = build.addContactPoint(“127.0.0.1”); Using the new builder object, create a cluster object. To do so, you have a method called build() in the Cluster.Builder class. The following code shows how to create a cluster object. //Building a cluster Cluster cluster = builder.build(); You can build a cluster object using a single line of code as shown below. Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1″).build(); Step 2: Create a Session Object Create an instance of Session object using the connect() method of Cluster class as shown below. Session session = cluster.connect( ); This method creates a new session and initializes it. If you already have a keyspace, then you can set it to the existing one by passing the KeySpace name in string format to this method as shown below. Session session = cluster.connect(“ Your keyspace name ” ); Here we are using the KeySpace called tp. Therefore, create the session object as shown below. Session session = cluster.connect(“ tp” ); Step 3: Execute Query You can execute CQL queries using the execute() method of Session class. Pass the query either in string format or as a Statement class object to the execute() method. Whatever you pass to this method in string format will be executed on the cqlsh. In the following example, we are inserting data in a table called emp. You have to store the query in a string variable and pass it to the execute() method as shown below. String query1 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,”ram”, ”Hyderabad”, 9848022338, 50000);” ; String query2 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(2,”robin”, ”Hyderabad”, 9848022339, 40000);” ; String query3 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(3,”rahman”, ”Chennai”, 9848022330, 45000);” ; session.execute(query1); session.execute(query2); session.execute(query3); Given below is the complete program to insert data into a table in Cassandra using Java API. import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; public class Create_Data { public static void main(String args[]){ //queries String query1 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal)” + ” VALUES(1,”ram”, ”Hyderabad”, 9848022338, 50000);” ; String query2 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal)” + ” VALUES(2,”robin”, ”Hyderabad”, 9848022339, 40000);” ; String query3 = “INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal)” + ” VALUES(3,”rahman”, ”Chennai”, 9848022330, 45000);” ; //Creating Cluster object Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); //Creating Session object Session session = cluster.connect(“tp”); //Executing the query session.execute(query1); session.execute(query2); session.execute(query3); System.out.println(“Data created”); } } Save the above program with the class name followed by .java, browse to the location where it is saved. Compile and execute the program as shown below. $javac Create_Data.java $java Create_Data Under normal conditions, it should produce the following output − Data created Print Page Previous Next Advertisements ”;
AWS Quicksight – Creating Story ”; Previous Next Story is an option wherein you capture a series of screens and play them one by one. For example, if you want to see a visual with different filter options, you can use story. To create a Story, click on Story on the leftmost panel. By default, there is a story with name Storyboard 1. Now capture the screen using the capture icon at the right most panel on the top. Each capture of the screen is also referred to as Scene. You can capture multiple scenes and those will get added under “Storyboard 1”. The data in the story gets automatically refreshed once your main data source is refreshed. Print Page Previous Next Advertisements ”;
Big Data Analytics – Home
Big Data Analytics Tutorial PDF Version Quick Guide Resources Job Search Discussion Big Data; as its name implies, the data which is bigger is known as big data. The data size is increasing day by day. An individual deals with data using mobile phones, tabs, and laptops while an organisation deals with business data; statistically it has been noted that the data size has drastically increased in the past decade. What is Big Data? The term “Big Data” usually refers to datasets that are too large, complex and unable to be processed by ordinary data processing systems to manage efficiently. These datasets can be derived from a variety of sources, including social media, sensors, internet activity, and mobile devices. The data can be structured, semi-structured and unstructured type of data. Big Data Analytics A process of analysing large and diverse data sets is known as “Big Data,” It discovers hidden patterns, unknown relationships, market trends, user preferences, and other important information. It uses advanced analytics techniques such as statistical analysis, machine learning, data mining, and predictive modelling to extract insights from enormous datasets. Organisations across the world capture terabytes of data about their users” interactions, business, social media, and also sensors from devices such as mobile phones and automobiles. The challenge of this era is to make sense of this sea of data. This is where big data analytics comes into the picture. Where Big Data Analytics Used? Big Data Analytics strives to assist organisations in making more informed business decisions, increasing operational efficiency, improving customer experiences and services, and making sure to sustain industries in a competitive world with their respective industries. The Big Data Analytics process involves data gathering, storage, processing, analysis, and visualisation of outcomes to make strategic business decisions. The process of converting large amounts of unstructured raw data, retrieved from different sources to a data product useful for organizations forms the core of Big Data Analytics. Overall, Big Data Analytics enables organizations to harness the vast amounts of data available to them and turn it into actionable insights that drive business growth and innovation. In this big data analytics tutorial, we will discuss the most fundamental concepts and methods of Big Data Analytics. Audience This tutorial has been prepared for software professionals aspiring to learn the basics of Big Data Analytics. Professionals who are into analytics, in general, may as well use this tutorial to good effect. Prerequisites Before you start proceeding with this tutorial, we assume that you have prior exposure to handling huge volumes of unprocessed data at an organizational level. Through this tutorial, we will develop a mini project to provide exposure to a real-world problem and how to solve it using Big Data Analytics. Print Page Previous Next Advertisements ”;
Apache Solr – Home
Apache Solr Tutorial PDF Version Quick Guide Resources Job Search Discussion Solr is a scalable, ready to deploy, search/storage engine optimized to search large volumes of text-centric data. Solr is enterprise-ready, fast and highly scalable. In this tutorial, we are going to learn the basics of Solr and how you can use it in practice. Audience This tutorial will be helpful for all those developers who would like to understand the basic functionalities of Apache Solr in order to develop sophisticated and high-performing applications. Prerequisites Before proceeding with this tutorial, we expect that the reader has good Java programming skills (although it is not mandatory) and some prior exposure to Lucene and Hadoop environment. Print Page Previous Next Advertisements ”;
Apache Solr – Overview
Apache Solr – Overview ”; Previous Next Solr is an open-source search platform which is used to build search applications. It was built on top of Lucene (full text search engine). Solr is enterprise-ready, fast and highly scalable. The applications built using Solr are sophisticated and deliver high performance. It was Yonik Seely who created Solr in 2004 in order to add search capabilities to the company website of CNET Networks. In Jan 2006, it was made an open-source project under Apache Software Foundation. Its latest version, Solr 6.0, was released in 2016 with support for execution of parallel SQL queries. Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. Not only search, Solr can also be used for storage purpose. Like other NoSQL databases, it is a non-relational data storage and processing technology. In short, Solr is a scalable, ready to deploy, search/storage engine optimized to search large volumes of text-centric data. Features of Apache Solr Solr is a wrap around Lucene’s Java API. Therefore, using Solr, you can leverage all the features of Lucene. Let us take a look at some of most prominent features of Solr − Restful APIs − To communicate with Solr, it is not mandatory to have Java programming skills. Instead you can use restful services to communicate with it. We enter documents in Solr in file formats like XML, JSON and .CSV and get results in the same file formats. Full text search − Solr provides all the capabilities needed for a full text search such as tokens, phrases, spell check, wildcard, and auto-complete. Enterprise ready − According to the need of the organization, Solr can be deployed in any kind of systems (big or small) such as standalone, distributed, cloud, etc. Flexible and Extensible − By extending the Java classes and configuring accordingly, we can customize the components of Solr easily. NoSQL database − Solr can also be used as big data scale NOSQL database where we can distribute the search tasks along a cluster. Admin Interface − Solr provides an easy-to-use, user friendly, feature powered, user interface, using which we can perform all the possible tasks such as manage logs, add, delete, update and search documents. Highly Scalable − While using Solr with Hadoop, we can scale its capacity by adding replicas. Text-Centric and Sorted by Relevance − Solr is mostly used to search text documents and the results are delivered according to the relevance with the user’s query in order. Unlike Lucene, you don’t need to have Java programming skills while working with Apache Solr. It provides a wonderful ready-to-deploy service to build a search box featuring autocomplete, which Lucene doesn’t provide. Using Solr, we can scale, distribute, and manage index, for large scale (Big Data) applications. Lucene in Search Applications Lucene is simple yet powerful Java-based search library. It can be used in any application to add search capability. Lucene is a scalable and high-performance library used to index and search virtually any kind of text. Lucene library provides the core operations which are required by any search application, such as Indexing and Searching. If we have a web portal with a huge volume of data, then we will most probably require a search engine in our portal to extract relevant information from the huge pool of data. Lucene works as the heart of any search application and provides the vital operations pertaining to indexing and searching. Print Page Previous Next Advertisements ”;
Apache Solr – On Hadoop
Apache Solr – On Hadoop ”; Previous Next Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. In this section, let us understand how you can install Hadoop on your system. Downloading Hadoop Given below are the steps to be followed to download Hadoop onto your system. Step 1 − Go to the homepage of Hadoop. You can use the link − www.hadoop.apache.org/. Click the link Releases, as highlighted in the following screenshot. It will redirect you to the Apache Hadoop Releases page which contains links for mirrors of source and binary files of various versions of Hadoop as follows − Step 2 − Select the latest version of Hadoop (in our tutorial, it is 2.6.4) and click its binary link. It will take you to a page where mirrors for Hadoop binary are available. Click one of these mirrors to download Hadoop. Download Hadoop from Command Prompt Open Linux terminal and login as super-user. $ su password: Go to the directory where you need to install Hadoop, and save the file there using the link copied earlier, as shown in the following code block. # cd /usr/local # wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop- 2.6.4/hadoop-2.6.4.tar.gz After downloading Hadoop, extract it using the following commands. # tar zxvf hadoop-2.6.4.tar.gz # mkdir hadoop # mv hadoop-2.6.4/* to hadoop/ # exit Installing Hadoop Follow the steps given below to install Hadoop in pseudo-distributed mode. Step 1: Setting Up Hadoop You can set the Hadoop environment variables by appending the following commands to ~/.bashrc file. export HADOOP_HOME = /usr/local/hadoop export HADOOP_MAPRED_HOME = $HADOOP_HOME export HADOOP_COMMON_HOME = $HADOOP_HOME export HADOOP_HDFS_HOME = $HADOOP_HOME export YARN_HOME = $HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR = $HADOOP_HOME/lib/native export PATH = $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL = $HADOOP_HOME Next, apply all the changes into the current running system. $ source ~/.bashrc Step 2: Hadoop Configuration You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure. $ cd $HADOOP_HOME/etc/hadoop In order to develop Hadoop programs in Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system. export JAVA_HOME = /usr/local/jdk1.7.0_71 The following are the list of files that you have to edit to configure Hadoop − core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml core-site.xml The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers. Open the core-site.xml and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure. Let us assume the following data. dfs.replication (data replication value) = 1 (In the below given path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) namenode path = //home/hadoop/hadoopinfra/hdfs/namenode (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path = //home/hadoop/hadoopinfra/hdfs/datanode Open this file and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value> </property> </configuration> Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure. yarn-site.xml This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> mapred-site.xml This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command. $ cp mapred-site.xml.template mapred-site.xml Open mapred-site.xml file and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Verifying Hadoop Installation The following steps are used to verify the Hadoop installation. Step 1: Name Node Setup Set up the namenode using the command “hdfs namenode –format” as follows. $ cd ~ $ hdfs namenode -format The expected result is as follows. 10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.4 … … 10/24/14 21:30:56 INFO common.Storage: Storage directory /home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 ************************************************************/ Step 2: Verifying the Hadoop dfs The following command is used to start the Hadoop dfs. Executing this command will start your Hadoop file system. $ start-dfs.sh The expected output is as follows − 10/24/14 21:37:56 Starting namenodes on [localhost] localhost: starting namenode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-namenode-localhost.out localhost: starting datanode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-datanode-localhost.out Starting secondary namenodes [0.0.0.0] Step 3: Verifying the Yarn Script The following command is used to start the Yarn script. Executing this command will start your Yarn demons. $ start-yarn.sh The expected output as follows − starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop-2.6.4/logs/yarn- hadoop-resourcemanager-localhost.out localhost: starting nodemanager, logging to /home/hadoop/hadoop- 2.6.4/logs/yarn-hadoop-nodemanager-localhost.out Step 4: Accessing Hadoop on Browser The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on browser. http://localhost:50070/ Installing Solr on Hadoop Follow the steps given below to download and install Solr. Step 1 Open the homepage of Apache Solr by clicking the following link − https://lucene.apache.org/solr/ Step 2 Click the download button (highlighted in the above screenshot). On clicking, you will be redirected to the page where you have various mirrors of Apache Solr. Select a mirror and click on it, which will redirect you to a page where
Apache Solr – Core
Apache Solr – Core ”; Previous Next A Solr Core is a running instance of a Lucene index that contains all the Solr configuration files required to use it. We need to create a Solr Core to perform operations like indexing and analyzing. A Solr application may contain one or multiple cores. If necessary, two cores in a Solr application can communicate with each other. Creating a Core After installing and starting Solr, you can connect to the client (web interface) of Solr. As highlighted in the following screenshot, initially there are no cores in Apache Solr. Now, we will see how to create a core in Solr. Using create command One way to create a core is to create a schema-less core using the create command, as shown below − [Hadoop@localhost bin]$ ./Solr create -c Solr_sample Here, we are trying to create a core named Solr_sample in Apache Solr. This command creates a core displaying the following message. Copying configuration to new core instance directory: /home/Hadoop/Solr/server/Solr/Solr_sample Creating new core ”Solr_sample” using command: http://localhost:8983/Solr/admin/cores?action=CREATE&name=Solr_sample&instanceD ir = Solr_sample { “responseHeader”:{ “status”:0, “QTime”:11550 }, “core”:”Solr_sample” } You can create multiple cores in Solr. On the left-hand side of the Solr Admin, you can see a core selector where you can select the newly created core, as shown in the following screenshot. Using create_core command Alternatively, you can create a core using the create_core command. This command has the following options − –c core_name Name of the core you wanted to create -p port_name Port at which you want to create the core -d conf_dir Configuration directory of the port Let’s see how you can use the create_core command. Here, we will try to create a core named my_core. [Hadoop@localhost bin]$ ./Solr create_core -c my_core On executing, the above command creates a core displaying the following message − Copying configuration to new core instance directory: /home/Hadoop/Solr/server/Solr/my_core Creating new core ”my_core” using command: http://localhost:8983/Solr/admin/cores?action=CREATE&name=my_core&instanceD ir = my_core { “responseHeader”:{ “status”:0, “QTime”:1346 }, “core”:”my_core” } Deleting a Core You can delete a core using the delete command of Apache Solr. Let’s suppose we have a core named my_core in Solr, as shown in the following screenshot. You can delete this core using the delete command by passing the name of the core to this command as follows − [Hadoop@localhost bin]$ ./Solr delete -c my_core On executing the above command, the specified core will be deleted displaying the following message. Deleting core ”my_core” using command: http://localhost:8983/Solr/admin/cores?action=UNLOAD&core = my_core&deleteIndex = true&deleteDataDir = true&deleteInstanceDir = true { “responseHeader” :{ “status”:0, “QTime”:170 } } You can open the web interface of Solr to verify whether the core has been deleted or not. Print Page Previous Next Advertisements ”;