home | career | drupal | java | mac | mysql | perl | scala | uml | unix  

Lucene example source code file (demo2.xml)

This example Lucene source code file (demo2.xml) is included in the DevDaily.com "Java Source Code Warehouse" project. The intent of this project is to help you "Learn Java by Example" TM.

Java - Lucene tags/keywords

files, if, indexfiles, java, lucene, lucene, note, query, query, the, the, this, this, unicode

The Lucene demo2.xml source code

<?xml version="1.0"?>
	Apache Lucene - Basic Demo Sources Walk-through
<author email="acoliver@apache.org">Andrew C. Oliver

<section id="About the Code">About the Code
In this section we walk through the sources behind the command-line Lucene demo: where to find them,
their parts and their function.  This section is intended for Java developers wishing to understand
how to use Lucene in their applications.

<section id="Location of the source">Location of the source

NOTE: to examine the sources, you need to download and extract a source checkout of 
Lucene: (lucene-{version}-src.zip).

Relative to the directory created when you extracted Lucene, you
should see a directory called <code>lucene/contrib/demo/.  This is the root for the Lucene
demo.  Under this directory is <code>src/java/org/apache/lucene/demo/.  This is where all
the Java sources for the demo live.

Within this directory you should see the <code>IndexFiles.java class we executed earlier.
Bring it up in <code>vi or your editor of choice and let's take a look at it.


<section id="IndexFiles">IndexFiles

As we discussed in the previous walk-through, the <a
href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles</a> class creates a Lucene
Index. Let's take a look at how it does this.

The <code>main() method parses the command-line parameters, then in preparation for
instantiating <a href="api/core/org/apache/lucene/index/IndexWriter.html">IndexWriter, opens a 
<a href="api/core/org/apache/lucene/store/Directory.html">Directory and instantiates
<a href="api/module-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html"
>StandardAnalyzer</a> and
<a href="api/core/org/apache/lucene/index/IndexWriterConfig.html">IndexWriterConfig.

The value of the <code>-index command-line parameter is the name of the filesystem directory
where all index information should be stored.  If <code>IndexFiles is invoked with a 
relative path given in the <code>-index command-line parameter, or if the -index
command-line parameter is not given, causing the default relative index path "<code>index"
to be used, the index path will be created as a subdirectory of the current working directory
(if it does not already exist).  On some platforms, the index path may be created in a different
directory (such as the user's home directory).

The <code>-docs command-line parameter value is the location of the directory containing
files to be indexed.
The <code>-update command-line parameter tells IndexFiles not to delete the
index if it already exists.  When <code>-update is not given, IndexFiles will
first wipe the slate clean before indexing any documents.

Lucene <a href="api/core/org/apache/lucene/store/Directory.html">Directorys are used by the
<code>IndexWriter to store information in the index.  In addition to the 
<a href="api/core/org/apache/lucene/store/FSDirectory.html">FSDirectory implementation we are using,
there are several other <code>Directory subclasses that can write to RAM, to databases, etc.
Lucene <a href="api/core/org/apache/lucene/analysis/Analyzer.html">Analyzers are processing pipelines
that break up text into indexed tokens, a.k.a. terms, and optionally perform other operations on these
tokens, e.g. downcasing, synonym insertion, filtering out unwanted tokens, etc.  The <code>Analyzer
we are using is <code>StandardAnalyzer, which creates tokens using the Word Break rules from the
Unicode Text Segmentation algorithm specified in <a href="http://unicode.org/reports/tr29/">Unicode
Standard Annex #29</a>; converts tokens to lowercase; and then filters out stopwords.  Stopwords are
common language words such as articles (a, an, the, etc.) and other tokens that may have less value for
searching.  It should be noted that there are different rules for every language, and you should use the
proper analyzer for each.  Lucene currently provides Analyzers for a number of different languages (see
the javadocs under 
<a href="api/contrib-analyzers/org/apache/lucene/analysis/package-summary.html"

The <code>IndexWriterConfig instance holds all configuration for IndexWriter.  For
example, we set the <code>OpenMode to use here based on the value of the -update
command-line parameter.
Looking further down in the file, after <code>IndexWriter is instantiated, you should see the
<code>indexDocs() code.  This recursive function crawls the directories and creates
<a href="api/core/org/apache/lucene/document/Document.html">Document objects.  The 
<code>Document is simply a data object to represent the text content from the file as well as
its creation time and location.  These instances are added to the <code>IndexWriter.  If
the <code>-update command-line parameter is given, the IndexWriter 
<code>OpenMode will be set to OpenMode.CREATE_OR_APPEND, and rather than
adding documents to the index, the <code>IndexWriter will update them
in the index by attempting to find an already-indexed document with the same identifier (in our
case, the file path serves as the identifier); deleting it from the index if it exists; and then
adding the new document to the index.


<section id="Searching Files">Searching Files

The <a href="api/contrib-demo/org/apache/lucene/demo/SearchFiles.html">SearchFiles class is
quite simple.  It primarily collaborates with an 
<a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher, 
<a href="api/modules-analysis-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html"
>StandardAnalyzer</a> (which is used in the
<a href="api/contrib-demo/org/apache/lucene/demo/IndexFiles.html">IndexFiles class as well)
and a <a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser.  The
query parser is constructed with an analyzer used to interpret your query text in the same way the
documents are interpreted: finding word boundaries, downcasing, and removing useless words like
'a', 'an' and 'the'.  The <a href="api/core/org/apache/lucene/search/Query.html">Query
object contains the results from the
<a href="api/core/org/apache/lucene/queryParser/QueryParser.html">QueryParser which is passed
to the searcher.  Note that it's also possible to programmatically construct a rich 
<a href="api/core/org/apache/lucene/search/Query.html">Query object without using the query
parser.  The query parser just enables decoding the <a href="queryparsersyntax.html">Lucene query
syntax</a> into the corresponding Query

<code>SearchFiles uses the IndexSearcher.search(query,n) method that returns
<a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs with max n hits.
The results are printed in pages, sorted by score (i.e. relevance).

Other Lucene examples (source code examples)

Here is a short list of links related to this Lucene demo2.xml source code file:

my book on functional programming


new blog posts


Copyright 1998-2019 Alvin Alexander, alvinalexander.com
All Rights Reserved.

A percentage of advertising revenue from
pages under the /java/jwarehouse URI on this website is
paid back to open source projects.