Lucene example source code file (data.txt)
The Lucene data.txt source code
# German special characters are replaced: häufig haufig # here the stemmer works okay, it maps related words to the same stem: abschließen abschliess abschließender abschliess abschließendes abschliess abschließenden abschliess Tisch tisch Tische tisch Tischen tisch Haus hau Hauses hau Häuser hau Häusern hau # here's a case where overstemming occurs, i.e. a word is # mapped to the same stem as unrelated words: hauen hau # here's a case where understemming occurs, i.e. two related words # are not mapped to the same stem. This is the case with basically # all irregular forms: Drama drama Dramen dram # replace "ß" with 'ss': Ausmaß ausmass # fake words to test if suffixes are cut off: xxxxxe xxxxx xxxxxs xxxxx xxxxxn xxxxx xxxxxt xxxxx xxxxxem xxxxx xxxxxer xxxxx xxxxxnd xxxxx # the suffixes are also removed when combined: xxxxxetende xxxxx # words that are shorter than four charcters are not changed: xxe xxe # -em and -er are not removed from words shorter than five characters: xxem xxem xxer xxer # -nd is not removed from words shorter than six characters: xxxnd xxxnd
Other Lucene examples (source code examples)
Here is a short list of links related to this Lucene data.txt source code file:
Copyright 1998-2019 Alvin Alexander, alvinalexander.com
All Rights Reserved.
A percentage of advertising revenue from
pages under the /java/jwarehouse URI on this website is
paid back to open source projects.