alvinalexander.com | career | drupal | java | mac | mysql | perl | scala | uml | unix  

Lucene example source code file (scoring.xml)

This example Lucene source code file (scoring.xml) is included in the DevDaily.com "Java Source Code Warehouse" project. The intent of this project is to help you "Learn Java by Example" TM.

Java - Lucene tags/keywords

changing, expert, lucene, lucene, query, query, scorer, scorer, scoring, scoring, searcher, similarity, the, weight

The Lucene scoring.xml source code

<?xml version="1.0"?>

<document>
	<header>
        <title>
	Apache Lucene - Scoring
		</title>
	</header>
    <properties>
        <author email="gsingers at apache.org">Grant Ingersoll
    </properties>

    <body>

        <section id="Introduction">Introduction
            <p>Lucene scoring is the heart of why we all love Lucene.  It is blazingly fast and it hides almost all of the complexity from the user.
                In a nutshell, it works.  At least, that is, until it doesn't work, or doesn't work as one would expect it to
            work.  Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
            scores lower than a different document with only one of the query terms. </p>
            <p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
            help you figure out the what and why of Lucene scoring.</p>
            <p>Lucene scoring uses a combination of the
                <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
                    Retrieval</a> and the Boolean model
                to determine
                how relevant a given Document is to a User's query.  In general, the idea behind the VSM is the more
                times a query term appears in a document relative to
                the number of times the term appears in all the documents in the collection, the more relevant that
                document is to the query.  It uses the Boolean model to first narrow down the documents that need to
                be scored based on the use of boolean logic in the Query specification.  Lucene also adds some
                capabilities and refinements onto this model to support boolean and fuzzy searching, but it
                essentially remains a VSM based system at the heart.
                For some valuable references on VSM and IR in general refer to the
                <a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references.
            </p>
            <p>The rest of this document will cover Scoring basics and how to change your
                <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity.  Next it will cover ways you can
                customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring
                -- Expert Level</a> which gives details on implementing your own
                <a href="api/core/org/apache/lucene/search/Query.html">Query class and related functionality.  Finally, we
                will finish up with some reference material in the <a href="#Appendix">Appendix.
            </p>
        </section>
        <section id="Scoring">Scoring
            <p>Scoring is very much dependent on the way documents are indexed,
                so it is important to understand indexing (see
                <a href="gettingstarted.html">Apache Lucene - Getting Started Guide
                and the Lucene
                <a href="fileformats.html">file formats
                before continuing on with this section.)  It is also assumed that readers know how to use the
                <a href="api/core/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc) functionality,
                which can go a long way in informing why a score is returned.
            </p>
            <section id="Fields and Documents">Fields and Documents
                <p>In Lucene, the objects we are scoring are
                    <a href="api/core/org/apache/lucene/document/Document.html">Documents.  A Document is a collection
                of
                    <a href="api/core/org/apache/lucene/document/Field.html">Fields.  Each Field has semantics about how
                it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.)  It is important to
                    note that Lucene scoring works on Fields and then combines the results to return Documents.  This is
                    important because two Documents with the exact same content, but one having the content in two Fields
                    and the other in one Field will return different scores for the same query due to length normalization
                    (assumming the
                    <a href="api/core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity
                    on the Fields).
                </p>
            </section>
            <section id="Score Boosting">Score Boosting
                <p>Lucene allows influencing search results by "boosting" in more than one level:
                  <ul>
                    <li>Document level boosting
                    - while indexing - by calling
                    <a href="api/core/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()
                    before a document is added to the index.
                    </li>
                    <li>Document's Field level boosting
                    - while indexing - by calling
                    <a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()
                    before adding a field to the document (and before adding the document to the index).
                    </li>
                    <li>Query level boosting
                     - during search, by setting a boost on a query clause, calling
                     <a href="api/core/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost().
                    </li>
                  </ul>
                </p>
                <p>Indexing time boosts are preprocessed for storage efficiency and written to
                  the directory (when writing the document) in a single byte (!) as follows:
                  For each field of a document, all boosts of that field
                  (i.e. all boosts under the same field name in that doc) are multiplied.
                  The result is multiplied by the boost of the document,
                  and also multiplied by a "field length norm" value
                  that represents the length of that field in that doc
                  (so shorter fields are automatically boosted up).
                  The result is decoded as a single byte
                  (with some precision loss of course) and stored in the directory.
                  The similarity object in effect at indexing computes the length-norm of the field.
                </p>
                <p>This composition of 1-byte representation of norms
                (that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
                is nicely described in
                <a href="api/core/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost().
                </p>
                <p>Encoding and decoding of the resulted float norm in a single byte are done by the
                static methods of the class Similarity:
                <a href="api/core/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm() and
                <a href="api/core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm().
                Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
                e.g. decode(encode(0.89)) = 0.75.
                At scoring (search) time, this norm is brought into the score of document
                as <b>norm(t, d), as shown by the formula in
                <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity.
                </p>
            </section>
            <section id="Understanding the Scoring Formula">Understanding the Scoring Formula

                <p>
                This scoring formula is described in the
                    <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity class.  Please take the time to study this formula, as it contains much of the information about how the
                    basics of Lucene scoring work, especially the
                    <a href="api/core/org/apache/lucene/search/TermQuery.html">TermQuery.
                </p>
            </section>
            <section id="The Big Picture">The Big Picture
                <p>OK, so the tf-idf formula and the
                    <a href="api/core/org/apache/lucene/search/Similarity.html">Similarity
                    is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
                    the use and interactions between the
                    <a href="api/core/org/apache/lucene/search/Query.html">Query classes, as created by each application in
                    response to a user's information need.
                </p>
                <p>In this regard, Lucene offers a wide variety of Query implementations, most of which are in the
                    <a href="api/core/org/apache/lucene/search/package-summary.html">org.apache.lucene.search package.
                    These implementations can be combined in a wide variety of ways to provide complex querying
                    capabilities along with
                    information about where matches took place in the document collection. The <a href="#Query Classes">Query
                    section below
                    highlights some of the more important Query classes.  For information on the other ones, see the
                    <a href="api/core/org/apache/lucene/search/package-summary.html">package summary.  For details on implementing
                    your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
                    Expert Level</a> below.
                </p>
                <p>Once a Query has been created and submitted to the
                    <a href="api/core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher, the scoring process
                begins.  (See the <a
                href="#Appendix">Appendix</a> Algorithm section for more notes on the process.)  After some infrastructure setup,
                control finally passes to the <a href="api/core/org/apache/lucene/search/Weight.html">Weight implementation and its
                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer instance.  In the case of any type of
                    <a href="api/core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery, scoring is handled by the
                    <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2
                    (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or
                    <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight
                    (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).
                </p>
                <p>
                    Assuming the use of the BooleanWeight2, a
                    BooleanScorer2 is created by bringing together
                    all of the
                    <a href="api/core/org/apache/lucene/search/Scorer.html">Scorers from the sub-clauses of the BooleanQuery.
                    When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
                    of clauses in the Query.  This internal Scorer essentially loops over the sub scorers and sums the scores
                    provided by each scorer while factoring in the coord() score.
                    <!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
                </p>
            </section>
            <section id="Query Classes">Query Classes
                <p>For information on the Query Classes, refer to the
                    <a href="api/core/org/apache/lucene/search/package-summary.html#query">search package javadocs
                </p>
            </section>
            <section id="Changing Similarity">Changing Similarity
                <p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors.  For information on
                how to do this, see the
                    <a href="api/core/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs

</section> </section> <section id="Changing your Scoring -- Expert Level">Changing your Scoring -- Expert Level <p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more about how to do this, refer to the <a href="api/core/org/apache/lucene/search/package-summary.html#scoring">search package javadocs </p> </section> <section id="Appendix">Appendix <section id="Algorithm">Algorithm <p>This section is mostly notes on stepping through the Scoring process and serves as fertilizer for the earlier sections.</p> <p>In the typical search application, a <a href="api/core/org/apache/lucene/search/Query.html">Query is passed to the <a href="api/core/org/apache/lucene/search/Searcher.html">Searcher</a> , beginning the scoring process. </p> <p>Once inside the Searcher, a <a href="api/core/org/apache/lucene/search/Collector.html">Collector is used for the scoring and sorting of the search results. These important objects are involved in a search: <ol> <li>The <a href="api/core/org/apache/lucene/search/Weight.html">Weight object of the Query. The Weight object is an internal representation of the Query that allows the Query to be reused by the Searcher. </li> <li>The Searcher that initiated the call. <li>A <a href="api/core/org/apache/lucene/search/Filter.html">Filter for limiting the result set. Note, the Filter may be null. </li> <li>A <a href="api/core/org/apache/lucene/search/Sort.html">Sort object for specifying how to sort the results if the standard score based sort method is not desired. </li> </ol> </p> <p> Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), we call one of the search methods of the Searcher, passing in the <a href="api/core/org/apache/lucene/search/Weight.html">Weight object created by Searcher.createWeight(Query), <a href="api/core/org/apache/lucene/search/Filter.html">Filter and the number of results we want. This method returns a <a href="api/core/org/apache/lucene/search/TopDocs.html">TopDocs object, which is an internal collection of search results. The Searcher creates a <a href="api/core/org/apache/lucene/search/TopScoreDocCollector.html">TopScoreDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the <a href="api/core/org/apache/lucene/search/Collector.html">Collector mechanism, see <a href="api/core/org/apache/lucene/search/Searcher.html">Searcher .) The TopDocCollector uses a <a href="api/core/org/apache/lucene/util/PriorityQueue.html">PriorityQueue to collect the top results for the search. </p> <p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer for the <a href="api/core/org/apache/lucene/index/IndexReader.html">IndexReader of the current searcher and we proceed by calling the score method on the <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer . </p> <p>At last, we are actually going to score some documents. The score method takes in the Collector (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here is where things get involved. The <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer that is returned by the <a href="api/core/org/apache/lucene/search/Weight.html">Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the <a href="api/core/org/apache/lucene/search/Scorer.html">Scorer is going to be a <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2 (see the section on customizing your scoring for info on changing this.) </p> <p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer#next() method. The next() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overriden by all derived implementations. <!-- DOUBLE CHECK THIS -->If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.</p> </section> </section> </body> </document>

Other Lucene examples (source code examples)

Here is a short list of links related to this Lucene scoring.xml source code file:

... this post is sponsored by my books ...

#1 New Release!

FP Best Seller

 

new blog posts

 

Copyright 1998-2021 Alvin Alexander, alvinalexander.com
All Rights Reserved.

A percentage of advertising revenue from
pages under the /java/jwarehouse URI on this website is
paid back to open source projects.