Lucene, as we know is a powerful and efficient search engine that provides with a vast range of text searching techniques as per user requirements. Lucene is much more than any other text searching engine as the choices given to the user are not discriminated on any basis. It has powerful and strong searching algorithms and supports a wide range of queries for searching.
Lucene applications have a basic cycle which primarily comprises of first indexing data on which searching is performed, at a specific path. The data added can be in any textual form as per the user's needs. Data is indexed and tokenized and then searching is performed on these tokens using queries. This is a two way process as it includes first the processing of data and then searching on it.
The following diagram depicts the basic Lucene model work flow.
Components of Lucene
In order to use Lucene for text searching, let us take a closer look at the main API used by Lucene. Please note that it is not a complete guide for Lucene. However, it will help you get the understanding of basic and radical Lucene API.
Following is the basic API used to build a Lucene based solution.
Directory: It is the base class that defines where the indexes are formed.
Document: It contains the data that is to be indexed.
Analyzer: While the data is being indexed, the analyzer decides the criteria on the basis of which tokenizing and searching is done on the data. Similarly it is used while querying data which also involves tokenizaion.
IndexWriter: Documents which are created beforehand are passed to the directory using this class. One indexWriter can be opened at a time.
IndexReader: Every directory where index are created may have reader(s) opened on it which is responsible for reading data from the indexes.
IndexSearcher: The searcher is responsible for querying the data through the reader(s).
Given below is the detailed description of these classes:
This class is a representation of the location where Lucene indexes are stored. A directory needs to be opened for maintaining indexes so that further operations can be performed on them.
A document is a collection of fields. These fields contain textual data as value against a field name. The value is the data that you want to index and then make searchable for future. With every field, user can specify whether he wants to analyze that field value or not. Document is basically the unit of search which will be returned when a field is searched against a
Analyzer as discussed earlier has the prime responsibility of tokenizing the data
into smaller chunks as per the type provided. It basically parse the fields of the documents into indexable tokens. The way of data being analyzed by the analyzer influences the user's ability to search the data. Listed below are the four basic and most commonly used Lucene analyzers.
This analyzer analyzes the data according to whitespaces as the name shows. Moreover this analyzer maintains the case of the words. It performs the tokenization of data according to the case it primarily contains.
It analyzes the data according to stop words as well as performs tokenization on the data in lower-case similar to the simple analyzer. Additionally it recognizes the URLs and email addresses and generates tokens accordingly. Due to this functionality it is most commonly used as it is considered the most intelligent analyzer.
It indexes the data in lower-case and splits them based on non- letters. It does not index URLs and non-letter characters such as symbols or numbers.
The stop analyzer generates tokens from the data according to the non-letter characters as well as stop words. Stop words are the words that are supposed to be ignored while indexing as the analyzers considers them irrelevant to be scanned through. Moreover the stop analyzer does not index URLs and non-letter characters such as symbols or numbers.
Consider the sentence below:
This is a tutorial for “DistributedLucene” at alachisoft.com.
|Whitespace Analyzer||Stop Analyzer||Simple Analyzer||Standard Analyzer|
In order to add the document and index it, we need an
IndexWriter. Writer needs:
A valid Lucene Directory: where document is to be indexed
Analyzer: so that the writer analyzes the data according to it.
Writer corresponds with the analyzer for the indexed data and then adds the results to the Directory for the storage of the data. Precisely, the writer is responsible for all the write operations to be correctly performed on the data and then store it. There is a set of operations that can be performed on the writer and the writer is disposed after all operations.
The main functionality of Lucene is to search the data added previously. For
this purpose we have
IndexReader which reads the data from the directory. On every update in the documents the
IndexReader needs to be re-initialized. You can either create an
IndexReader or get an instance of
IndexReader from the
IndexWriter. You pass the directory instance to the reader
which has all the documents indexed.
IndexSearcheris the initialized with the instance of
IndexReader which takes the query from the user and searches the data
accordingly. A query is an abstract class provided by Lucene and is used to define the scope and type of searching to be performed. On searching we are
TopDocs which is the result of searching with the query.
Now that the query is created, you parse the query using the
Analyzer is passed to the queryParser which parses the query according to that
analyzer. Moreover the fieldname added in the document is passed to the query
searcher so that it searches the data accordingly.
An array is maintained in
topDocs which is
ScoreDocs will maintain the search results on the query
execution basis. On iterating over this array, the resultant data can be retrieved.
Recommendation: The analyzer used for searching is recommended to be the same with which the data was indexed in the first place.