Apache Solr Tutorial PDF Version Quick Guide Resources Job Search Discussion Solr is a scalable, ready to deploy, search/storage engine optimized to search large volumes of text-centric data. Solr is enterprise-ready, fast and highly scalable. In this tutorial, we are going to learn the basics of Solr and how you can use it in practice. Audience This tutorial will be helpful for all those developers who would like to understand the basic functionalities of Apache Solr in order to develop sophisticated and high-performing applications. Prerequisites Before proceeding with this tutorial, we expect that the reader has good Java programming skills (although it is not mandatory) and some prior exposure to Lucene and Hadoop environment. Print Page Previous Next Advertisements ”;
Category: Big Data & Analytics
Apache Solr – Overview
Apache Solr – Overview ”; Previous Next Solr is an open-source search platform which is used to build search applications. It was built on top of Lucene (full text search engine). Solr is enterprise-ready, fast and highly scalable. The applications built using Solr are sophisticated and deliver high performance. It was Yonik Seely who created Solr in 2004 in order to add search capabilities to the company website of CNET Networks. In Jan 2006, it was made an open-source project under Apache Software Foundation. Its latest version, Solr 6.0, was released in 2016 with support for execution of parallel SQL queries. Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. Not only search, Solr can also be used for storage purpose. Like other NoSQL databases, it is a non-relational data storage and processing technology. In short, Solr is a scalable, ready to deploy, search/storage engine optimized to search large volumes of text-centric data. Features of Apache Solr Solr is a wrap around Lucene’s Java API. Therefore, using Solr, you can leverage all the features of Lucene. Let us take a look at some of most prominent features of Solr − Restful APIs − To communicate with Solr, it is not mandatory to have Java programming skills. Instead you can use restful services to communicate with it. We enter documents in Solr in file formats like XML, JSON and .CSV and get results in the same file formats. Full text search − Solr provides all the capabilities needed for a full text search such as tokens, phrases, spell check, wildcard, and auto-complete. Enterprise ready − According to the need of the organization, Solr can be deployed in any kind of systems (big or small) such as standalone, distributed, cloud, etc. Flexible and Extensible − By extending the Java classes and configuring accordingly, we can customize the components of Solr easily. NoSQL database − Solr can also be used as big data scale NOSQL database where we can distribute the search tasks along a cluster. Admin Interface − Solr provides an easy-to-use, user friendly, feature powered, user interface, using which we can perform all the possible tasks such as manage logs, add, delete, update and search documents. Highly Scalable − While using Solr with Hadoop, we can scale its capacity by adding replicas. Text-Centric and Sorted by Relevance − Solr is mostly used to search text documents and the results are delivered according to the relevance with the user’s query in order. Unlike Lucene, you don’t need to have Java programming skills while working with Apache Solr. It provides a wonderful ready-to-deploy service to build a search box featuring autocomplete, which Lucene doesn’t provide. Using Solr, we can scale, distribute, and manage index, for large scale (Big Data) applications. Lucene in Search Applications Lucene is simple yet powerful Java-based search library. It can be used in any application to add search capability. Lucene is a scalable and high-performance library used to index and search virtually any kind of text. Lucene library provides the core operations which are required by any search application, such as Indexing and Searching. If we have a web portal with a huge volume of data, then we will most probably require a search engine in our portal to extract relevant information from the huge pool of data. Lucene works as the heart of any search application and provides the vital operations pertaining to indexing and searching. Print Page Previous Next Advertisements ”;
Apache Solr – On Hadoop
Apache Solr – On Hadoop ”; Previous Next Solr can be used along with Hadoop. As Hadoop handles a large amount of data, Solr helps us in finding the required information from such a large source. In this section, let us understand how you can install Hadoop on your system. Downloading Hadoop Given below are the steps to be followed to download Hadoop onto your system. Step 1 − Go to the homepage of Hadoop. You can use the link − www.hadoop.apache.org/. Click the link Releases, as highlighted in the following screenshot. It will redirect you to the Apache Hadoop Releases page which contains links for mirrors of source and binary files of various versions of Hadoop as follows − Step 2 − Select the latest version of Hadoop (in our tutorial, it is 2.6.4) and click its binary link. It will take you to a page where mirrors for Hadoop binary are available. Click one of these mirrors to download Hadoop. Download Hadoop from Command Prompt Open Linux terminal and login as super-user. $ su password: Go to the directory where you need to install Hadoop, and save the file there using the link copied earlier, as shown in the following code block. # cd /usr/local # wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop- 2.6.4/hadoop-2.6.4.tar.gz After downloading Hadoop, extract it using the following commands. # tar zxvf hadoop-2.6.4.tar.gz # mkdir hadoop # mv hadoop-2.6.4/* to hadoop/ # exit Installing Hadoop Follow the steps given below to install Hadoop in pseudo-distributed mode. Step 1: Setting Up Hadoop You can set the Hadoop environment variables by appending the following commands to ~/.bashrc file. export HADOOP_HOME = /usr/local/hadoop export HADOOP_MAPRED_HOME = $HADOOP_HOME export HADOOP_COMMON_HOME = $HADOOP_HOME export HADOOP_HDFS_HOME = $HADOOP_HOME export YARN_HOME = $HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR = $HADOOP_HOME/lib/native export PATH = $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL = $HADOOP_HOME Next, apply all the changes into the current running system. $ source ~/.bashrc Step 2: Hadoop Configuration You can find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure. $ cd $HADOOP_HOME/etc/hadoop In order to develop Hadoop programs in Java, you have to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system. export JAVA_HOME = /usr/local/jdk1.7.0_71 The following are the list of files that you have to edit to configure Hadoop − core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml core-site.xml The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for the file system, memory limit for storing the data, and size of Read/Write buffers. Open the core-site.xml and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> hdfs-site.xml The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure. Let us assume the following data. dfs.replication (data replication value) = 1 (In the below given path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created by hdfs file system.) namenode path = //home/hadoop/hadoopinfra/hdfs/namenode (hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path = //home/hadoop/hadoopinfra/hdfs/datanode Open this file and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value> </property> </configuration> Note − In the above file, all the property values are user-defined and you can make changes according to your Hadoop infrastructure. yarn-site.xml This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file. <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> mapred-site.xml This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file from mapred-site,xml.template to mapred-site.xml file using the following command. $ cp mapred-site.xml.template mapred-site.xml Open mapred-site.xml file and add the following properties inside the <configuration>, </configuration> tags. <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Verifying Hadoop Installation The following steps are used to verify the Hadoop installation. Step 1: Name Node Setup Set up the namenode using the command “hdfs namenode –format” as follows. $ cd ~ $ hdfs namenode -format The expected result is as follows. 10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost/192.168.1.11 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.6.4 … … 10/24/14 21:30:56 INFO common.Storage: Storage directory /home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11 ************************************************************/ Step 2: Verifying the Hadoop dfs The following command is used to start the Hadoop dfs. Executing this command will start your Hadoop file system. $ start-dfs.sh The expected output is as follows − 10/24/14 21:37:56 Starting namenodes on [localhost] localhost: starting namenode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-namenode-localhost.out localhost: starting datanode, logging to /home/hadoop/hadoop-2.6.4/logs/hadoop- hadoop-datanode-localhost.out Starting secondary namenodes [0.0.0.0] Step 3: Verifying the Yarn Script The following command is used to start the Yarn script. Executing this command will start your Yarn demons. $ start-yarn.sh The expected output as follows − starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop-2.6.4/logs/yarn- hadoop-resourcemanager-localhost.out localhost: starting nodemanager, logging to /home/hadoop/hadoop- 2.6.4/logs/yarn-hadoop-nodemanager-localhost.out Step 4: Accessing Hadoop on Browser The default port number to access Hadoop is 50070. Use the following URL to get Hadoop services on browser. http://localhost:50070/ Installing Solr on Hadoop Follow the steps given below to download and install Solr. Step 1 Open the homepage of Apache Solr by clicking the following link − https://lucene.apache.org/solr/ Step 2 Click the download button (highlighted in the above screenshot). On clicking, you will be redirected to the page where you have various mirrors of Apache Solr. Select a mirror and click on it, which will redirect you to a page where
Apache Solr – Core
Apache Solr – Core ”; Previous Next A Solr Core is a running instance of a Lucene index that contains all the Solr configuration files required to use it. We need to create a Solr Core to perform operations like indexing and analyzing. A Solr application may contain one or multiple cores. If necessary, two cores in a Solr application can communicate with each other. Creating a Core After installing and starting Solr, you can connect to the client (web interface) of Solr. As highlighted in the following screenshot, initially there are no cores in Apache Solr. Now, we will see how to create a core in Solr. Using create command One way to create a core is to create a schema-less core using the create command, as shown below − [Hadoop@localhost bin]$ ./Solr create -c Solr_sample Here, we are trying to create a core named Solr_sample in Apache Solr. This command creates a core displaying the following message. Copying configuration to new core instance directory: /home/Hadoop/Solr/server/Solr/Solr_sample Creating new core ”Solr_sample” using command: http://localhost:8983/Solr/admin/cores?action=CREATE&name=Solr_sample&instanceD ir = Solr_sample { “responseHeader”:{ “status”:0, “QTime”:11550 }, “core”:”Solr_sample” } You can create multiple cores in Solr. On the left-hand side of the Solr Admin, you can see a core selector where you can select the newly created core, as shown in the following screenshot. Using create_core command Alternatively, you can create a core using the create_core command. This command has the following options − –c core_name Name of the core you wanted to create -p port_name Port at which you want to create the core -d conf_dir Configuration directory of the port Let’s see how you can use the create_core command. Here, we will try to create a core named my_core. [Hadoop@localhost bin]$ ./Solr create_core -c my_core On executing, the above command creates a core displaying the following message − Copying configuration to new core instance directory: /home/Hadoop/Solr/server/Solr/my_core Creating new core ”my_core” using command: http://localhost:8983/Solr/admin/cores?action=CREATE&name=my_core&instanceD ir = my_core { “responseHeader”:{ “status”:0, “QTime”:1346 }, “core”:”my_core” } Deleting a Core You can delete a core using the delete command of Apache Solr. Let’s suppose we have a core named my_core in Solr, as shown in the following screenshot. You can delete this core using the delete command by passing the name of the core to this command as follows − [Hadoop@localhost bin]$ ./Solr delete -c my_core On executing the above command, the specified core will be deleted displaying the following message. Deleting core ”my_core” using command: http://localhost:8983/Solr/admin/cores?action=UNLOAD&core = my_core&deleteIndex = true&deleteDataDir = true&deleteInstanceDir = true { “responseHeader” :{ “status”:0, “QTime”:170 } } You can open the web interface of Solr to verify whether the core has been deleted or not. Print Page Previous Next Advertisements ”;
Apache Solr – Deleting Documents ”; Previous Next Deleting the Document To delete documents from the index of Apache Solr, we need to specify the ID’s of the documents to be deleted between the <delete></delete> tags. <delete> <id>003</id> <id>005</id> <id>004</id> <id>002</id> </delete> Here, this XML code is used to delete the documents with ID’s 003 and 005. Save this code in a file with the name delete.xml. If you want to delete the documents from the index which belongs to the core named my_core, then you can post the delete.xml file using the post tool, as shown below. [Hadoop@localhost bin]$ ./post -c my_core delete.xml On executing the above command, you will get the following output. /home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core 6.2.0.jar -Dauto = yes -Dc = my_core -Ddata = files org.apache.Solr.util.SimplePostTool delete.xml SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/Solr/my_core/update… Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots, rtf,htm,html,txt,log POSTing file delete.xml (application/xml) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/Solr/my_core/update… Time spent: 0:00:00.179 Verification Visit the homepage of the of Apache Solr web interface and select the core as my_core. Try to retrieve all the documents by passing the query “:” in the text area q and execute the query. On executing, you can observe that the specified documents are deleted. Deleting a Field Sometimes we need to delete documents based on fields other than ID. For example, we may have to delete the documents where the city is Chennai. In such cases, you need to specify the name and value of the field within the <query></query> tag pair. <delete> <query>city:Chennai</query> </delete> Save it as delete_field.xml and perform the delete operation on the core named my_core using the post tool of Solr. [Hadoop@localhost bin]$ ./post -c my_core delete_field.xml On executing the above command, it produces the following output. /home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core 6.2.0.jar -Dauto = yes -Dc = my_core -Ddata = files org.apache.Solr.util.SimplePostTool delete_field.xml SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/Solr/my_core/update… Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots, rtf,htm,html,txt,log POSTing file delete_field.xml (application/xml) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/Solr/my_core/update… Time spent: 0:00:00.084 Verification Visit the homepage of the of Apache Solr web interface and select the core as my_core. Try to retrieve all the documents by passing the query “:” in the text area q and execute the query. On executing, you can observe that the documents containing the specified field value pair are deleted. Deleting All Documents Just like deleting a specific field, if you want to delete all the documents from an index, you just need to pass the symbol “:” between the tags <query></ query>, as shown below. <delete> <query>*:*</query> </delete> Save it as delete_all.xml and perform the delete operation on the core named my_core using the post tool of Solr. [Hadoop@localhost bin]$ ./post -c my_core delete_all.xml On executing the above command, it produces the following output. /home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core 6.2.0.jar -Dauto = yes -Dc = my_core -Ddata = files org.apache.Solr.util.SimplePostTool deleteAll.xml SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/Solr/my_core/update… Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf, htm,html,txt,log POSTing file deleteAll.xml (application/xml) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/Solr/my_core/update… Time spent: 0:00:00.138 Verification Visit the homepage of Apache Solr web interface and select the core as my_core. Try to retrieve all the documents by passing the query “:” in the text area q and execute the query. On executing, you can observe that the documents containing the specified field value pair are deleted. Deleting all the documents using Java (Client API) Following is the Java program to add documents to Apache Solr index. Save this code in a file with the name UpdatingDocument.java. import java.io.IOException; import org.apache.Solr.client.Solrj.SolrClient; import org.apache.Solr.client.Solrj.SolrServerException; import org.apache.Solr.client.Solrj.impl.HttpSolrClient; import org.apache.Solr.common.SolrInputDocument; public class DeletingAllDocuments { public static void main(String args[]) throws SolrServerException, IOException { //Preparing the Solr client String urlString = “http://localhost:8983/Solr/my_core”; SolrClient Solr = new HttpSolrClient.Builder(urlString).build(); //Preparing the Solr document SolrInputDocument doc = new SolrInputDocument(); //Deleting the documents from Solr Solr.deleteByQuery(“*”); //Saving the document Solr.commit(); System.out.println(“Documents deleted”); } } Compile the above code by executing the following commands in the terminal − [Hadoop@localhost bin]$ javac DeletingAllDocuments [Hadoop@localhost bin]$ java DeletingAllDocuments On executing the above command, you will get the following output. Documents deleted Print Page Previous Next Advertisements ”;
Apache Solr – Indexing Data
Apache Solr – Indexing Data ”; Previous Next In general, indexing is an arrangement of documents or (other entities) systematically. Indexing enables users to locate information in a document. Indexing collects, parses, and stores documents. Indexing is done to increase the speed and performance of a search query while finding a required document. Indexing in Apache Solr In Apache Solr, we can index (add, delete, modify) various document formats such as xml, csv, pdf, etc. We can add data to Solr index in several ways. In this chapter, we are going to discuss indexing − Using the Solr Web Interface. Using any of the client APIs like Java, Python, etc. Using the post tool. In this chapter, we will discuss how to add data to the index of Apache Solr using various interfaces (command line, web interface, and Java client API) Adding Documents using Post Command Solr has a post command in its bin/ directory. Using this command, you can index various formats of files such as JSON, XML, CSV in Apache Solr. Browse through the bin directory of Apache Solr and execute the –h option of the post command, as shown in the following code block. [Hadoop@localhost bin]$ cd $SOLR_HOME [Hadoop@localhost bin]$ ./post -h On executing the above command, you will get a list of options of the post command, as shown below. Usage: post -c <collection> [OPTIONS] <files|directories|urls|-d [“..”]> or post –help collection name defaults to DEFAULT_SOLR_COLLECTION if not specified OPTIONS ======= Solr options: -url <base Solr update URL> (overrides collection, host, and port) -host <host> (default: localhost) -p or -port <port> (default: 8983) -commit yes|no (default: yes) Web crawl options: -recursive <depth> (default: 1) -delay <seconds> (default: 10) Directory crawl options: -delay <seconds> (default: 0) stdin/args options: -type <content/type> (default: application/xml) Other options: -filetypes <type>[,<type>,…] (default: xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots, rtf,htm,html,txt,log) -params “<key> = <value>[&<key> = <value>…]” (values must be URL-encoded; these pass through to Solr update request) -out yes|no (default: no; yes outputs Solr response to console) -format Solr (sends application/json content as Solr commands to /update instead of /update/json/docs) Examples: * JSON file:./post -c wizbang events.json * XML files: ./post -c records article*.xml * CSV file: ./post -c signals LATEST-signals.csv * Directory of files: ./post -c myfiles ~/Documents * Web crawl: ./post -c gettingstarted http://lucene.apache.org/Solr -recursive 1 -delay 1 * Standard input (stdin): echo ”{commit: {}}” | ./post -c my_collection – type application/json -out yes –d * Data as string: ./post -c signals -type text/csv -out yes -d $”id,valuen1,0.47” Example Suppose we have a file named sample.csv with the following content (in the bin directory). Student ID First Name Lasst Name Phone City 001 Rajiv Reddy 9848022337 Hyderabad 002 Siddharth Bhattacharya 9848022338 Kolkata 003 Rajesh Khanna 9848022339 Delhi 004 Preethi Agarwal 9848022330 Pune 005 Trupthi Mohanty 9848022336 Bhubaneshwar 006 Archana Mishra 9848022335 Chennai The above dataset contains personal details like Student id, first name, last name, phone, and city. The CSV file of the dataset is shown below. Here, you must note that you need to mention the schema, documenting its first line. id, first_name, last_name, phone_no, location 001, Pruthvi, Reddy, 9848022337, Hyderabad 002, kasyap, Sastry, 9848022338, Vishakapatnam 003, Rajesh, Khanna, 9848022339, Delhi 004, Preethi, Agarwal, 9848022330, Pune 005, Trupthi, Mohanty, 9848022336, Bhubaneshwar 006, Archana, Mishra, 9848022335, Chennai You can index this data under the core named sample_Solr using the post command as follows − [Hadoop@localhost bin]$ ./post -c Solr_sample sample.csv On executing the above command, the given document is indexed under the specified core, generating the following output. /home/Hadoop/java/bin/java -classpath /home/Hadoop/Solr/dist/Solr-core 6.2.0.jar -Dauto = yes -Dc = Solr_sample -Ddata = files org.apache.Solr.util.SimplePostTool sample.csv SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/Solr/Solr_sample/update… Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf, htm,html,txt,log POSTing file sample.csv (text/csv) to [base] 1 files indexed. COMMITting Solr index changes to http://localhost:8983/Solr/Solr_sample/update… Time spent: 0:00:00.228 Visit the homepage of Solr Web UI using the following URL − http://localhost:8983/ Select the core Solr_sample. By default, the request handler is /select and the query is “:”. Without doing any modifications, click the ExecuteQuery button at the bottom of the page. On executing the query, you can observe the contents of the indexed CSV document in JSON format (default), as shown in the following screenshot. Note − In the same way, you can index other file formats such as JSON, XML, CSV, etc. Adding Documents using the Solr Web Interface You can also index documents using the web interface provided by Solr. Let us see how to index the following JSON document. [ { “id” : “001”, “name” : “Ram”, “age” : 53, “Designation” : “Manager”, “Location” : “Hyderabad”, }, { “id” : “002”, “name” : “Robert”, “age” : 43, “Designation” : “SR.Programmer”, “Location” : “Chennai”, }, { “id” : “003”, “name” : “Rahim”, “age” : 25, “Designation” : “JR.Programmer”, “Location” : “Delhi”, } ] Step 1 Open Solr web interface using the following URL − http://localhost:8983/ Step 2 Select the core Solr_sample. By default, the values of the fields Request Handler, Common Within, Overwrite, and Boost are /update, 1000, true, and 1.0 respectively, as shown in the following screenshot. Now, choose the document format you want from JSON, CSV, XML, etc. Type the document to be indexed in the text area and click the Submit Document button, as shown in the following screenshot. Adding Documents using Java Client API Following is the Java program to add documents to Apache Solr index. Save this code in a file with the name AddingDocument.java. import java.io.IOException; import org.apache.Solr.client.Solrj.SolrClient; import org.apache.Solr.client.Solrj.SolrServerException; import org.apache.Solr.client.Solrj.impl.HttpSolrClient; import org.apache.Solr.common.SolrInputDocument; public class AddingDocument { public static void main(String args[]) throws Exception { //Preparing the Solr client String urlString = “http://localhost:8983/Solr/my_core”; SolrClient Solr = new HttpSolrClient.Builder(urlString).build(); //Preparing the Solr document SolrInputDocument doc = new SolrInputDocument(); //Adding fields to the document doc.addField(“id”, “003”); doc.addField(“name”, “Rajaman”); doc.addField(“age”,”34″); doc.addField(“addr”,”vishakapatnam”); //Adding the document to Solr Solr.add(doc); //Saving the changes Solr.commit(); System.out.println(“Documents added”); } } Compile the above code by executing the following commands in the terminal − [Hadoop@localhost bin]$ javac AddingDocument [Hadoop@localhost bin]$ java AddingDocument
AWS Quicksight – Quick Guide
AWS Quicksight – Quick Guide ”; Previous Next AWS Quicksight – Overview AWS Quicksight is one of the most powerful Business Intelligence tools which allows you to create interactive dashboards within minutes to provide business insights into the organizations. There are number of visualizations or graphical formats available in which the dashboards can be created. The dashboards get automatically updated as the data is updated or scheduled. You can also embed the dashboard created in Quicksight to your web application. With the latest ML insights, also known as Machine Learning insights, Quicksight uses its inbuilt algorithms to find any kind of anomalies or peaks in the historical data. This helps to get prepared with the business requirements ahead of time based on these insights. Here is quick guide to get started with Quicksight. Below is official product description page from AWS − https://aws.amazon.com/quicksight/ You can also subscribe for an AWS trial account by filling the below mentioned information and click on Continue button. AWS Quicksight – Landing Page To access AWS Quicksight tool, you can open it directly by passing this URL in web browser or navigating to AWS Console → Services https://aws.amazon.com/quicksight/ Once you open this URL, on top right corner click on “Sign in to the Console”. You need to provide the below details to login to Quicksight tool − Account ID or alias IAM User name Password Once you login into Quicksight, you will see the below screen − As marked in the above image, Section A − “New Analysis” icon is used to create a new analysis. When you click on this, it will ask you to select any data set. You can also create a new data set as shown below − Section B − The “Manage data” icon will show all the data sets that have already been input to Quicksight. This option can be used to manage the dataset without creating any analysis. Section C − It shows various data sources you have already connected to. You can also connect to a new data source or upload a file. Section D − This section contains icon for already created analysis, published dashboards and tutorial videos explaining about Quicksight in detail. You can click on each tab to view them as below − All analysis Here, you can see all the existing analysis in AWS Quicksight account including report and dashboards. All dashboards This option shows only existing dashboards in AWS Quicksight account. Tutorial videos Other option to open Quicksight console is by navigating to AWS console using below URL − https://aws.amazon.com/console/ Once you login, you need to navigate to Services tab and search for Quicksight in search bar. If you have recently used Quicksight services in AWS account, it will be seen under History tab. AWS Quicksight – Using Data Sources AWS Quicksight accepts data from various sources. Once you click on “New Dataset” on the home page, it gives you options of all the data sources that can be used. Below are the sources containing the list of all internal and external sources − Let us go through connecting Quicksight with some of the most commonly used data sources − Uploading a file from system It allows you to input .csv, .tsv, .clf,.elf.xlsx and Json format files only. Once you select the file, Quicksight automatically recognizes the file and displays the data. When you click on Upload a File button, you need to provide the location of file which you want to use to create dataset. Using a file from S3 format The screen will appear as below. Under Data source name, you can enter the name to be displayed for the data set that would be created. Also you would require either uploading a manifest file from your local system or providing the S3 location of the manifest file. Manifest file is a json format file, which specifies the url/location of input files and their format. You can enter more than one input files, provided the format is same. Here is an example of a manifest file. The “URI” parameter used to pass the location of input file is S3. { “fileLocations”: [ { “URIs”: [ “url of first file”, “url of second file”, “url of 3rd file and so on” ] }, ], } “globalUploadSettings”: { “format”: “CSV”, “delimiter”: “,”, “textqualifier”: “””, “containsHeader”: “true” } The parameters passed in globalUploadSettings are the default ones. You can change these parameters as per your requirements. MySQL You need to enter the database information in the fields to connect to your database. Once it is connected to your database, you can import the data from it. Following information is required when you connect to any RDBMS database − DSN name Type of connection Database server name Port Database name User name Password Following RDBMS based data sources are supported in Quicksight − Amazon Athena Amazon Aurora Amazon Redshift Amazon Redshift Spectrum Amazon S3 Amazon S3 Analytics Apache Spark 2.0 or later MariaDB 10.0 or later Microsoft SQL Server 2012 or later MySQL 5.1 or later PostgreSQL 9.3.1 or later Presto 0.167 or later Snowflake Teradata 14.0 or later Athena Athena is the AWS tool to run queries on tables. You can choose any table from Athena or run a custom query on those tables and use the output of those queries in Quicksight. There are couple of steps to choose data source When you choose Athena, below screen appears. You can input any data source name which you want to give to your data source in Quicksight. Click on “Validate Connection”. Once the connection is validated, click on the “Create new source” button Now choose the table name from the dropdown. The dropdown will show the databases present in Athena which will further show tables in that database. Else you can click on “Use custom SQL” to run query on Athena tables. Once done, you can click on “Edit/Preview data” or “Visualize” to either edit your data or directly
Cassandra – Create Table
Cassandra – Create Table ”; Previous Next Creating a Table You can create a table using the command CREATE TABLE. Given below is the syntax for creating a table. Syntax CREATE (TABLE | COLUMNFAMILY) <tablename> (”<column-definition>” , ”<column-definition>”) (WITH <option> AND <option>) Defining a Column You can define a column as shown below. column name1 data type, column name2 data type, example: age int, name text Primary Key The primary key is a column that is used to uniquely identify a row. Therefore,defining a primary key is mandatory while creating a table. A primary key is made of one or more columns of a table. You can define a primary key of a table as shown below. CREATE TABLE tablename( column1 name datatype PRIMARYKEY, column2 name data type, column3 name data type. ) or CREATE TABLE tablename( column1 name datatype PRIMARYKEY, column2 name data type, column3 name data type, PRIMARY KEY (column1) ) Example Given below is an example to create a table in Cassandra using cqlsh. Here we are − Using the keyspace tutorialspoint Creating a table named emp It will have details such as employee name, id, city, salary, and phone number. Employee id is the primary key. cqlsh> USE tutorialspoint; cqlsh:tutorialspoint>; CREATE TABLE emp( emp_id int PRIMARY KEY, emp_name text, emp_city text, emp_sal varint, emp_phone varint ); Verification The select statement will give you the schema. Verify the table using the select statement as shown below. cqlsh:tutorialspoint> select * from emp; emp_id | emp_city | emp_name | emp_phone | emp_sal ——–+———-+———-+———–+——— (0 rows) Here you can observe the table created with the given columns. Since we have deleted the keyspace tutorialspoint, you will not find it in the keyspaces list. Creating a Table using Java API You can create a table using the execute() method of Session class. Follow the steps given below to create a table using Java API. Step1: Create a Cluster Object First of all, create an instance of the Cluster.builder class of com.datastax.driver.core package as shown below. //Creating Cluster.Builder object Cluster.Builder builder1 = Cluster.builder(); Add a contact point (IP address of the node) using the addContactPoint() method of Cluster.Builder object. This method returns Cluster.Builder. //Adding contact point to the Cluster.Builder object Cluster.Builder builder2 = build.addContactPoint( “127.0.0.1” ); Using the new builder object, create a cluster object. To do so, you have a method called build() in the Cluster.Builder class. The following code shows how to create a cluster object. //Building a cluster Cluster cluster = builder.build(); You can build a cluster object using a single line of code as shown below. Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); Step 2: Create a Session Object Create an instance of Session object using the connect() method of Cluster class as shown below. Session session = cluster.connect( ); This method creates a new session and initializes it. If you already have a keyspace, you can set it to the existing one by passing the keyspace name in string format to this method as shown below. Session session = cluster.connect(“ Your keyspace name ” ); Here we are using the keyspace named tp. Therefore, create the session object as shown below. Session session = cluster.connect(“ tp” ); Step 3: Execute Query You can execute CQL queries using the execute() method of Session class. Pass the query either in string format or as a Statement class object to the execute() method. Whatever you pass to this method in string format will be executed on the cqlsh. In the following example, we are creating a table named emp. You have to store the query in a string variable and pass it to the execute() method as shown below. //Query String query = “CREATE TABLE emp(emp_id int PRIMARY KEY, ” + “emp_name text, ” + “emp_city text, ” + “emp_sal varint, ” + “emp_phone varint );”; session.execute(query); Given below is the complete program to create and use a keyspace in Cassandra using Java API. import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; public class Create_Table { public static void main(String args[]){ //Query String query = “CREATE TABLE emp(emp_id int PRIMARY KEY, ” + “emp_name text, ” + “emp_city text, ” + “emp_sal varint, ” + “emp_phone varint );”; //Creating Cluster object Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); //Creating Session object Session session = cluster.connect(“tp”); //Executing the query session.execute(query); System.out.println(“Table created”); } } Save the above program with the class name followed by .java, browse to the location where it is saved. Compile and execute the program as shown below. $javac Create_Table.java $java Create_Table Under normal conditions, it should produce the following output − Table created Print Page Previous Next Advertisements ”;
Cassandra – Update Data
Cassandra – Update Data ”; Previous Next Updating Data in a Table UPDATE is the command used to update data in a table. The following keywords are used while updating data in a table − Where − This clause is used to select the row to be updated. Set − Set the value using this keyword. Must − Includes all the columns composing the primary key. While updating rows, if a given row is unavailable, then UPDATE creates a fresh row. Given below is the syntax of UPDATE command − UPDATE <tablename> SET <column name> = <new value> <column name> = <value>…. WHERE <condition> Example Assume there is a table named emp. This table stores the details of employees of a certain company, and it has the following details − emp_id emp_name emp_city emp_phone emp_sal 1 ram Hyderabad 9848022338 50000 2 robin Hyderabad 9848022339 40000 3 rahman Chennai 9848022330 45000 Let us now update emp_city of robin to Delhi, and his salary to 50000. Given below is the query to perform the required updates. cqlsh:tutorialspoint> UPDATE emp SET emp_city=”Delhi”,emp_sal=50000 WHERE emp_id=2; Verification Use SELECT statement to verify whether the data has been updated or not. If you verify the emp table using SELECT statement, it will produce the following output. cqlsh:tutorialspoint> select * from emp; emp_id | emp_city | emp_name | emp_phone | emp_sal ——–+———–+———-+————+——— 1 | Hyderabad | ram | 9848022338 | 50000 2 | Delhi | robin | 9848022339 | 50000 3 | Chennai | rahman | 9848022330 | 45000 (3 rows) Here you can observe the table data has got updated. Updating Data using Java API You can update data in a table using the execute() method of Session class. Follow the steps given below to update data in a table using Java API. Step1: Create a Cluster Object Create an instance of Cluster.builder class of com.datastax.driver.core package as shown below. //Creating Cluster.Builder object Cluster.Builder builder1 = Cluster.builder(); Add a contact point (IP address of the node) using the addContactPoint() method of Cluster.Builder object. This method returns Cluster.Builder. //Adding contact point to the Cluster.Builder object Cluster.Builder builder2 = build.addContactPoint(“127.0.0.1”); Using the new builder object, create a cluster object. To do so, you have a method called build() in the Cluster.Builder class. Use the following code to create the cluster object. //Building a cluster Cluster cluster = builder.build(); You can build the cluster object using a single line of code as shown below. Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1″).build(); Step 2: Create a Session Object Create an instance of Session object using the connect() method of Cluster class as shown below. Session session = cluster.connect( ); This method creates a new session and initializes it. If you already have a keyspace, then you can set it to the existing one by passing the KeySpace name in string format to this method as shown below. Session session = cluster.connect(“ Your keyspace name”); Here we are using the KeySpace named tp. Therefore, create the session object as shown below. Session session = cluster.connect(“tp”); Step 3: Execute Query You can execute CQL queries using the execute() method of Session class. Pass the query either in string format or as a Statement class object to the execute() method. Whatever you pass to this method in string format will be executed on the cqlsh. In the following example, we are updating the emp table. You have to store the query in a string variable and pass it to the execute() method as shown below: String query = “ UPDATE emp SET emp_city=”Delhi”,emp_sal=50000 WHERE emp_id = 2;” ; Given below is the complete program to update data in a table using Java API. import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; public class Update_Data { public static void main(String args[]){ //query String query = ” UPDATE emp SET emp_city=”Delhi”,emp_sal=50000″ //Creating Cluster object Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); //Creating Session object Session session = cluster.connect(“tp”); //Executing the query session.execute(query); System.out.println(“Data updated”); } } Save the above program with the class name followed by .java, browse to the location where it is saved. Compile and execute the program as shown below. $javac Update_Data.java $java Update_Data Under normal conditions, it should produce the following output − Data updated Print Page Previous Next Advertisements ”;
Cassandra – Alter Table
Cassandra – Alter Table Altering a Table You can alter a table using the command ALTER TABLE. Given below is the syntax for creating a table. Syntax ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction> Using ALTER command, you can perform the following operations − Add a column Drop a column Adding a Column Using ALTER command, you can add a column to a table. While adding columns, you have to take care that the column name is not conflicting with the existing column names and that the table is not defined with compact storage option. Given below is the syntax to add a column to a table. ALTER TABLE table name ADD new column datatype; Example Given below is an example to add a column to an existing table. Here we are adding a column called emp_email of text datatype to the table named emp. cqlsh:tutorialspoint> ALTER TABLE emp … ADD emp_email text; Verification Use the SELECT statement to verify whether the column is added or not. Here you can observe the newly added column emp_email. cqlsh:tutorialspoint> select * from emp; emp_id | emp_city | emp_email | emp_name | emp_phone | emp_sal ——–+———-+———–+———-+———–+——— Dropping a Column Using ALTER command, you can delete a column from a table. Before dropping a column from a table, check that the table is not defined with compact storage option. Given below is the syntax to delete a column from a table using ALTER command. ALTER table name DROP column name; Example Given below is an example to drop a column from a table. Here we are deleting the column named emp_email. cqlsh:tutorialspoint> ALTER TABLE emp DROP emp_email; Verification Verify whether the column is deleted using the select statement, as shown below. cqlsh:tutorialspoint> select * from emp; emp_id | emp_city | emp_name | emp_phone | emp_sal ——–+———-+———-+———–+——— (0 rows) Since emp_email column has been deleted, you cannot find it anymore. Altering a Table using Java API You can create a table using the execute() method of Session class. Follow the steps given below to alter a table using Java API. Step1: Create a Cluster Object First of all, create an instance of Cluster.builder class of com.datastax.driver.core package as shown below. //Creating Cluster.Builder object Cluster.Builder builder1 = Cluster.builder(); Add a contact point (IP address of the node) using the addContactPoint() method of Cluster.Builder object. This method returns Cluster.Builder. //Adding contact point to the Cluster.Builder object Cluster.Builder builder2 = build.addContactPoint( “127.0.0.1” ); Using the new builder object, create a cluster object. To do so, you have a method called build() in the Cluster.Builder class. The following code shows how to create a cluster object. //Building a cluster Cluster cluster = builder.build(); You can build a cluster object using a single line of code as shown below. Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); Step 2: Create a Session Object Create an instance of Session object using the connect() method of Cluster class as shown below. Session session = cluster.connect( ); This method creates a new session and initializes it. If you already have a keyspace, you can set it to the existing one by passing the KeySpace name in string format to this method as shown below. Session session = cluster.connect(“ Your keyspace name ” ); Session session = cluster.connect(“ tp” ); Here we are using the KeySpace named tp. Therefore, create the session object as shown below. Step 3: Execute Query You can execute CQL queries using the execute() method of Session class. Pass the query either in string format or as a Statement class object to the execute() method. Whatever you pass to this method in string format will be executed on the cqlsh. In the following example, we are adding a column to a table named emp. To do so, you have to store the query in a string variable and pass it to the execute() method as shown below. //Query String query1 = “ALTER TABLE emp ADD emp_email text”; session.execute(query); Given below is the complete program to add a column to an existing table. import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; public class Add_column { public static void main(String args[]){ //Query String query = “ALTER TABLE emp ADD emp_email text”; //Creating Cluster object Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); //Creating Session object Session session = cluster.connect(“tp”); //Executing the query session.execute(query); System.out.println(“Column added”); } } Save the above program with the class name followed by .java, browse to the location where it is saved. Compile and execute the program as shown below. $javac Add_Column.java $java Add_Column Under normal conditions, it should produce the following output − Column added Deleting a Column Given below is the complete program to delete a column from an existing table. import com.datastax.driver.core.Cluster; import com.datastax.driver.core.Session; public class Delete_Column { public static void main(String args[]){ //Query String query = “ALTER TABLE emp DROP emp_email;”; //Creating Cluster object Cluster cluster = Cluster.builder().addContactPoint(“127.0.0.1”).build(); //Creating Session object Session session = cluster.connect(“tp”); //executing the query session.execute(query); System.out.println(“Column deleted”); } } Save the above program with the class name followed by .java, browse to the location where it is saved. Compile and execute the program as shown below. $javac Delete_Column.java $java Delete_Column Under normal conditions, it should produce the following output − Column deleted Print Page Previous Next Advertisements ”;