<td>segments.gen, segments_N <td>Stores information about segments </tr> <tr> <td>Lock File <td>write.lock <td>The Write lock prevents multiple IndexWriters from writing to the same file. </tr> <tr> <td>Compound File <td>.cfs <td>An optional "virtual" file consisting of all the other index files for systems that frequently run out of file handles.</td> </tr> <tr> <td>Fields <td>.fnm <td>Stores information about the fields </tr> <tr> <td>Field Index <td>.fdx <td>Contains pointers to field data </tr> <tr> <td>Field Data <td>.fdt <td>The stored fields for documents </tr> <tr> <td>Term Infos <td>.tis <td>Part of the term dictionary, stores term info </tr> <tr> <td>Term Info Index <td>.tii <td>The index into the Term Infos file </tr> <tr> <td>Frequencies <td>.frq <td>Contains the list of docs which contain each term along with frequency </tr> <tr> <td>Positions <td>.prx <td>Stores position information about where a term occurs in the index </tr> <tr> <td>Norms <td>.nrm <td>Encodes length and boost factors for docs and fields </tr> <tr> <td>Term Vector Index <td>.tvx <td>Stores offset into the document data file </tr> <tr> <td>Term Vector Documents <td>.tvd <td>Contains information about each document that has term vectors </tr> <tr> <td>Term Vector Fields <td>.tvf <td>The field level info about term vectors </tr> <tr> <td>Deleted Documents <td>.del <td>Info about what files are deleted </tr> </table> </section> <section id="Primitive Types">Primitive Types <section id="Byte">Byte The most primitive type is an eight-bit byte. Files are accessed as sequences of bytes. All other data types are defined as sequences of bytes, so file formats are byte-order independent. </section> <section id="UInt32">UInt32 32-bit unsigned integers are written as four bytes, high-order bytes first. UInt32 --> <Byte>4 </section> <section id="Uint64">Uint64 64-bit unsigned integers are written as eight bytes, high-order bytes first. UInt64 --> <Byte>⁸ </section> <section id="VInt">VInt A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on. VInt Encoding Example <table width="100%" border="0" cellpadding="4" cellspacing="0"> <col width="64*"/> <col width="64*"/> <col width="64*"/> <col width="64*"/> <tr valign="TOP"> <td width="25%"> Value </td> <td width="25%"> First byte </td> <td width="25%"> Second byte </td> <td width="25%"> Third byte </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="0" sdnum="1033;0;#,##0"> 0 </td> <td width="25%" sdval="0" sdnum="1033;0;00000000"> 00000000 </td> <td width="25%" sdnum="1033;0;00000000"> </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="1" sdnum="1033;0;#,##0"> 1 </td> <td width="25%" sdval="1" sdnum="1033;0;00000000"> 00000001 </td> <td width="25%" sdnum="1033;0;00000000"> </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="2" sdnum="1033;0;#,##0"> 2 </td> <td width="25%" sdval="10" sdnum="1033;0;00000000"> 00000010 </td> <td width="25%" sdnum="1033;0;00000000"> </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr> <td width="25%" valign="TOP"> ... </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="127" sdnum="1033;0;#,##0"> 127 </td> <td width="25%" sdval="1111111" sdnum="1033;0;00000000"> 01111111 </td> <td width="25%" sdnum="1033;0;00000000"> </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="128" sdnum="1033;0;#,##0"> 128 </td> <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> 10000000 </td> <td width="25%" sdval="1" sdnum="1033;0;00000000"> 00000001 </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="129" sdnum="1033;0;#,##0"> 129 </td> <td width="25%" sdval="10000001" sdnum="1033;0;00000000"> 10000001 </td> <td width="25%" sdval="1" sdnum="1033;0;00000000"> 00000001 </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="130" sdnum="1033;0;#,##0"> 130 </td> <td width="25%" sdval="10000010" sdnum="1033;0;00000000"> 10000010 </td> <td width="25%" sdval="1" sdnum="1033;0;00000000"> 00000001 </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr> <td width="25%" valign="TOP"> ... </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="16383" sdnum="1033;0;#,##0"> 16,383 </td> <td width="25%" sdval="11111111" sdnum="1033;0;00000000"> 11111111 </td> <td width="25%" sdval="1111111" sdnum="1033;0;00000000"> 01111111 </td> <td width="25%" sdnum="1033;0;00000000"> </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="16384" sdnum="1033;0;#,##0"> 16,384 </td> <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> 10000000 </td> <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> 10000000 </td> <td width="25%" sdval="1" sdnum="1033;0;00000000"> 00000001 </td> </tr> <tr valign="BOTTOM"> <td width="25%" sdval="16385" sdnum="1033;0;#,##0"> 16,385 </td> <td width="25%" sdval="10000001" sdnum="1033;0;00000000"> 10000001 </td> <td width="25%" sdval="10000000" sdnum="1033;0;00000000"> 10000000 </td> <td width="25%" sdval="1" sdnum="1033;0;00000000"> 00000001 </td> </tr> <tr> <td width="25%" valign="TOP"> ... </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> <td width="25%" valign="BOTTOM" sdnum="1033;0;00000000"> </td> </tr> </table> This provides compression while still being efficient to decode. </section> <section id="Chars">Chars Lucene writes unicode character sequences as UTF-8 encoded bytes. </section> <section id="String">String Lucene writes strings as UTF-8 encoded bytes. First the length, in bytes, is written as a VInt, followed by the bytes. String --> VInt, Chars </section> </section> <section id="Compound Types">Compound Types <section id="MapStringString">Map<String,String> In a couple places Lucene stores a Map String->String. Map<String,String> --> Count<String,String>Count </section> </section> <section id="Per-Index Files">Per-Index Files The files in this section exist one-per-index. <section id="Segments File">Segments File The active segments in the index are stored in the segment info file, <tt>segments_N. There may be one or more <tt>segments_N files in the index; however, the one with the largest generation is the active one (when older segments_N files are present it's because they temporarily cannot be deleted, or, a writer is in the process of committing, or a custom <a href="api/core/org/apache/lucene/index/IndexDeletionPolicy.html">IndexDeletionPolicy is in use). This file lists each segment by name, has details about the separate norms and deletion files, and also contains the size of each segment. As of 2.1, there is also a file <tt>segments.gen. This file contains the current generation (the <tt>_N in <tt>segments_N) of the index. This is used only as a fallback in case the current generation cannot be accurately determined by directory listing alone (as is the case for some NFS clients with time-based directory cache expiraation). This file simply contains an Int32 version header (SegmentInfos.FORMAT_LOCKLESS = -2), followed by the generation recorded as Int64, written twice. 3.1 Segments --> Format, Version, NameCounter, SegCount, <SegVersion, SegName, SegSize, DelGen, DocStoreOffset, [DocStoreSegment, DocStoreIsCompoundFile], HasSingleNormFile, NumField, NormGenNumField, IsCompoundFile, DeletionCount, HasProx, Diagnostics, HasVectors>SegCount, CommitUserData, Checksum Format, NameCounter, SegCount, SegSize, NumField, DocStoreOffset, DeletionCount --> Int32 Version, DelGen, NormGen, Checksum --> Int64 SegVersion, SegName, DocStoreSegment --> String Diagnostics --> Map<String,String> IsCompoundFile, HasSingleNormFile, DocStoreIsCompoundFile, HasProx, HasVectors --> Int8 CommitUserData --> Map<String,String> Format is -9 (SegmentInfos.FORMAT_DIAGNOSTICS). Version counts how often the index has been changed by adding or deleting documents. NameCounter is used to generate names for new segment files. SegVersion is the code version that created the segment. SegName is the name of the segment, and is used as the file name prefix for all of the files that compose the segment's index. SegSize is the number of documents contained in the segment index. DelGen is the generation count of the separate deletes file. If this is -1, there are no separate deletes. If it is 0, this is a pre-2.1 segment and you must check filesystem for the existence of _X.del. Anything above zero means there are separate deletes (_X_N.del). NumField is the size of the array for NormGen, or -1 if there are no NormGens stored. NormGen records the generation of the separate norms files. If NumField is -1, there are no normGens stored and they are all assumed to be 0 when the segment file was written pre-2.1 and all assumed to be -1 when the segments file is 2.1 or above. The generation then has the same meaning as delGen (above). IsCompoundFile records whether the segment is written as a compound file or not. If this is -1, the segment is not a compound file. If it is 1, the segment is a compound file. Else it is 0, which means we check filesystem to see if _X.cfs exists. If HasSingleNormFile is 1, then the field norms are written as a single joined file (with extension <tt>.nrm); if it is 0 then each field's norms are stored as separate <tt>.fN files. See "Normalization Factors" below for details. DocStoreOffset, DocStoreSegment, DocStoreIsCompoundFile: If DocStoreOffset is -1, this segment has its own doc store (stored fields values and term vectors) files and DocStoreSegment and DocStoreIsCompoundFile are not stored. In this case all files for stored field values (<tt>*.fdt and *.fdx) and term vectors (<tt>*.tvf, *.tvd and <tt>*.tvx) will be stored with this segment. Otherwise, DocStoreSegment is the name of the segment that has the shared doc store files; DocStoreIsCompoundFile is 1 if that segment is stored in compound file format (as a <tt>.cfx file); and DocStoreOffset is the starting document in the shared doc store files where this segment's documents begin. In this case, this segment does not store its own doc store files but instead shares a single set of these files with other segments. Checksum contains the CRC32 checksum of all bytes in the segments_N file up until the checksum. This is used to verify integrity of the file on opening the index. DeletionCount records the number of deleted documents in this segment. HasProx is 1 if any fields in this segment have omitTf set to false; else, it's 0. CommitUserData stores an optional user-supplied opaque Map<String,String> that was passed to IndexWriter's commit or prepareCommit, or IndexReader's flush methods. The Diagnostics Map is privately written by IndexWriter, as a debugging aid, for each segment it creates. It includes metadata like the current Lucene version, OS, Java version, why the segment was created (merge, flush, addIndexes), etc. HasVectors is 1 if this segment stores term vectors, else it's 0. </section> <section id="Lock File">Lock File The write lock, which is stored in the index directory by default, is named "write.lock". If the lock directory is different from the index directory then the write lock will be named "XXXX-write.lock" where XXXX is a unique prefix derived from the full path to the index directory. When this file is present, a writer is currently modifying the index (adding or removing documents). This lock file ensures that only one writer is modifying the index at a time. </section> <section id="Deletable File">Deletable File A writer dynamically computes the files that are deletable, instead, so no file is written. </section> <section id="Compound Files">Compound Files Starting with Lucene 1.4 the compound file format became default. This is simply a container for all files described in the next section (except for the .del file). Compound (.cfs) --> FileCount, <DataOffset, FileName> FileCount , FileData FileCount FileCount --> VInt

DataOffset --> Long

FileName --> String

FileData --> raw file data

The raw file data is the data from the individual files named above.

Starting with Lucene 2.3, doc store files (stored field values and term vectors) can be shared in a single set of files for more than one segment. When compound file is enabled, these shared files will be added into a single compound file (same format as above) but with the extension <tt>.cfx. </section> </section> <section id="Per-Segment Files">Per-Segment Files The remaining files are all per-segment, and are thus defined by suffix. <section id="Fields">Fields Field Info Field names are stored in the field info file, with suffix .fnm. FieldInfos (.fnm) --> FNMVersion,FieldsCount, <FieldName, FieldBits> FieldsCount FNMVersion, FieldsCount --> VInt FieldName --> String FieldBits --> Byte <ul> <li> The low-order bit is one for indexed fields, and zero for non-indexed fields. </li> <li> The second lowest-order bit is one for fields that have term vectors stored, and zero for fields without term vectors. </li> <li>If the third lowest-order bit is set (0x04), term positions are stored with the term vectors. <li>If the fourth lowest-order bit is set (0x08), term offsets are stored with the term vectors. <li>If the fifth lowest-order bit is set (0x10), norms are omitted for the indexed field. <li>If the sixth lowest-order bit is set (0x20), payloads are stored for the indexed field. </ul> FNMVersion (added in 2.9) is always -2. Fields are numbered by their order in this file. Thus field zero is the first field in the file, field one the next, and so on. Note that, like document numbers, field numbers are segment relative. Stored Fields Stored fields are represented by two files: <ol> <li> The field index, or .fdx file. This contains, for each document, a pointer to its field data, as follows: FieldIndex (.fdx) --> <FieldValuesPosition> SegSize FieldValuesPosition --> Uint64 This is used to find the location within the field data file of the fields of a particular document. Because it contains fixed-length data, this file may be easily randomly accessed. The position of document n 's field data is the Uint64 at n*8 in this file. </li> <li> The field data, or .fdt file. This contains the stored fields of each document, as follows: FieldData (.fdt) --> <DocFieldData> SegSize DocFieldData --> FieldCount, <FieldNum, Bits, Value> FieldCount FieldCount --> VInt FieldNum --> VInt Bits --> Byte <ul> <li>low order bit is one for tokenized fields <li>second bit is one for fields containing binary data <li>third bit is one for fields with compression option enabled (if compression is enabled, the algorithm used is ZLIB), only available for indexes until Lucene version 2.9.x</li> <li>4th to 6th bit (mask: 0x7<<3) define the type of a numeric field: <ul> <li>all bits in mask are cleared if no numeric field at all <li>1<<3: Value is Int <li>2<<3: Value is Long <li>3<<3: Value is Int as Float (as of Float.intBitsToFloat) <li>4<<3: Value is Long as Double (as of Double.longBitsToDouble) </ul> </ul> Value --> String | BinaryValue | Int | Long (depending on Bits) BinaryValue --> ValueSize, <Byte>^ValueSize ValueSize --> VInt </li> </ol> </section> <section id="Term Dictionary">Term Dictionary The term dictionary is represented as two files: <ol> <li> The term infos, or tis file. TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermInfos TIVersion --> UInt32 TermCount --> UInt64 IndexInterval --> UInt32 SkipInterval --> UInt32 MaxSkipLevels --> UInt32 TermInfos --> <TermInfo> TermCount TermInfo --> <Term, DocFreq, FreqDelta, ProxDelta, SkipDelta> Term --> <PrefixLength, Suffix, FieldNum> Suffix --> String PrefixLength, DocFreq, FreqDelta, ProxDelta, SkipDelta --> VInt This file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text. TIVersion names the version of the format of this file and is equal to TermInfosWriter.FORMAT_CURRENT. Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y". FieldNumber determines the term's field, whose name is stored in the .fdt file. DocFreq is the count of documents which contain the term. FreqDelta determines the position of this term's TermFreqs within the .frq file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file). ProxDelta determines the position of this term's TermPositions within the .prx file. In particular, it is the difference between the position of this term's data in that file and the position of the previous term's data (or zero, for the first term in the file. For fields with omitTf true, this will be 0 since prox information is not stored. SkipDelta determines the position of this term's SkipData within the .frq file. In particular, it is the number of bytes after TermFreqs that the SkipData starts. In other words, it is the length of the TermFreq data. SkipDelta is only stored if DocFreq is not smaller than SkipInterval. </li> <li> The term info index, or .tii file. This contains every IndexInterval th entry from the .tis file, along with its location in the "tis" file. This is designed to be read entirely into memory and used to provide random access to the "tis" file. The structure of this file is very similar to the .tis file, with the addition of one item per record, the IndexDelta. TermInfoIndex (.tii)--> TIVersion, IndexTermCount, IndexInterval, SkipInterval, MaxSkipLevels, TermIndices TIVersion --> UInt32 IndexTermCount --> UInt64 IndexInterval --> UInt32 SkipInterval --> UInt32 TermIndices --> <TermInfo, IndexDelta> IndexTermCount IndexDelta --> VLong IndexDelta determines the position of this term's TermInfo within the .tis file. In particular, it is the difference between the position of this term's entry in that file and the position of the previous term's entry. SkipInterval is the fraction of TermDocs stored in skip tables. It is used to accelerate TermDocs.skipTo(int). Larger values result in smaller indexes, greater acceleration, but fewer accelerable cases, while smaller values result in bigger indexes, less acceleration (in case of a small value for MaxSkipLevels) and more accelerable cases. MaxSkipLevels is the max. number of skip levels stored for each term in the .frq file. A low value results in smaller indexes but less acceleration, a larger value results in slighly larger indexes but greater acceleration. See format of .frq file for more information about skip levels. </li> </ol> </section> <section id="Frequencies">Frequencies The .frq file contains the lists of documents which contain each term, along with the frequency of the term in that document (if omitTf is false). FreqFile (.frq) --> <TermFreqs, SkipData> TermCount TermFreqs --> <TermFreq> DocFreq TermFreq --> DocDelta[, Freq?] SkipData --> <<SkipLevelLength, SkipLevel> NumSkipLevels-1, SkipLevel> <SkipDatum> SkipLevel --> <SkipDatum> DocFreq/(SkipInterval^(Level + 1)) SkipDatum --> DocSkip,PayloadLength?,FreqSkip,ProxSkip,SkipChildLevelPointer? DocDelta,Freq,DocSkip,PayloadLength,FreqSkip,ProxSkip --> VInt SkipChildLevelPointer --> VLong TermFreqs are ordered by term (the term is implicit, from the .tis file). TermFreq entries are ordered by increasing document number. DocDelta: if omitTf is false, this determines both the document number and the frequency. In particular, DocDelta/2 is the difference between this document number and the previous document number (or zero when this is the first document in a TermFreqs). When DocDelta is odd, the frequency is one. When DocDelta is even, the frequency is read as another VInt. If omitTf is true, DocDelta contains the gap (not multiplied by 2) between document numbers and no frequency information is stored. For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with omitTf false, would be the following sequence of VInts: 15, 8, 3 If omitTf were true it would be this sequence of VInts instead: 7,4 DocSkip records the document number before every SkipInterval th document in TermFreqs. If payloads are disabled for the term's field, then DocSkip represents the difference from the previous value in the sequence. If payloads are enabled for the term's field, then DocSkip/2 represents the difference from the previous value in the sequence. If payloads are enabled and DocSkip is odd, then PayloadLength is stored indicating the length of the last payload before the SkipIntervalth document in TermPositions. FreqSkip and ProxSkip record the position of every SkipInterval th entry in FreqFile and ProxFile, respectively. File positions are relative to the start of TermFreqs and Positions, to the previous SkipDatum in the sequence. For example, if DocFreq=35 and SkipInterval=16, then there are two SkipData entries, containing the 15 th and 31 st document numbers in TermFreqs. The first FreqSkip names the number of bytes after the beginning of TermFreqs that the 16 th SkipDatum starts, and the second the number of bytes after that that the 32 nd starts. The first ProxSkip names the number of bytes after the beginning of Positions that the 16 th SkipDatum starts, and the second the number of bytes after that that the 32 nd starts. Each term can have multiple skip levels. The amount of skip levels for a term is NumSkipLevels = Min(MaxSkipLevels, floor(log(DocFreq/log(SkipInterval)))). The number of SkipData entries for a skip level is DocFreq/(SkipInterval^(Level + 1)), whereas the lowest skip level is Level=0. 
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. Then skip level 0 has 8 SkipData entries, containing the 3rd, 7^th, 11^th, 15^th, 19^th, 23^rd, 27th, and 31^st document numbers in TermFreqs. Skip level 1 has 2 SkipData entries, containing the 15th and 31^st document numbers in TermFreqs.

The SkipData entries on all upper levels > 0 contain a SkipChildLevelPointer referencing the corresponding SkipData entry in level-1. In the example has entry 15 on level 1 a pointer to entry 15 on level 0 and entry 31 on level 1 a pointer to entry 31 on level 0. </section> <section id="Positions">Positions The .prx file contains the lists of positions that each term occurs at within documents. Note that fields with omitTf true do not store anything into this file, and if all fields in the index have omitTf true then the .prx file will not exist. ProxFile (.prx) --> <TermPositions> TermCount TermPositions --> <Positions> DocFreq Positions --> <PositionDelta,Payload?> Freq Payload --> <PayloadLength?,PayloadData> PositionDelta --> VInt PayloadLength --> VInt PayloadData --> bytePayloadLength TermPositions are ordered by term (the term is implicit, from the .tis file). Positions entries are ordered by increasing document number (the document number is implicit from the .frq file). PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If payloads are enabled for the term's field, then PositionDelta/2 is the difference between the current and the previous position. If payloads are enabled and PositionDelta is odd, then PayloadLength is stored, indicating the length of the payload at the current term position. For example, the TermPositions for a term which occurs as the fourth term in one document, and as the fifth and ninth term in a subsequent document, would be the following sequence of VInts (payloads disabled): 4, 5, 4 PayloadData is metadata associated with the current term position. If PayloadLength is stored at the current position, then it indicates the length of this Payload. If PayloadLength is not stored, then this Payload has the same length as the Payload at the previous position. </section> <section id="Normalization Factors">Normalization Factors There's a single .nrm file containing all norms: AllNorms (.nrm) --> NormsHeader,<Norms> NumFieldsWithNorms Norms --> <Byte> SegSize NormsHeader --> 'N','R','M',Version Version --> Byte NormsHeader has 4 bytes, last of which is the format version for this file, currently -1. Each byte encodes a floating point value. Bits 0-2 contain the 3-bit mantissa, and bits 3-8 contain the 5-bit exponent. These are converted to an IEEE single float value as follows: <ol> <li> If the byte is zero, use a zero float. </li> <li> Otherwise, set the sign bit of the float to zero; </li> <li> add 48 to the exponent and use this as the float's exponent; </li> <li> map the mantissa to the high-order 3 bits of the float's mantissa; and </li> <li> set the low-order 21 bits of the float's mantissa to zero. </li> </ol> A separate norm file is created when the norm values of an existing segment are modified. When field N is modified, a separate norm file .sN is created, to maintain the norm values for that field. Separate norm files are created (when adequate) for both compound and non compound segments. </section> <section id="Term Vectors">Term Vectors Term Vector support is an optional on a field by field basis. It consists of 3 files. <ol> <li> The Document Index or .tvx file.

For each document, this stores the offset into the document data (.tvd) and field data (.tvf) files. DocumentIndex (.tvx) --> TVXVersion<DocumentPosition,FieldPosition> NumDocs TVXVersion --> Int (TermVectorsReader.CURRENT)

DocumentPosition --> UInt64 (offset in the .tvd file) FieldPosition --> UInt64 (offset in the .tvf file) </li> <li> The Document or .tvd file.

This contains, for each document, the number of fields, a list of the fields with term vector info and finally a list of pointers to the field information in the .tvf (Term Vector Fields) file. Document (.tvd) --> TVDVersion<NumFields, FieldNums, FieldPositions> NumDocs TVDVersion --> Int (TermVectorsReader.FORMAT_CURRENT)

NumFields --> VInt

FieldNums --> <FieldNumDelta> NumFields FieldNumDelta --> VInt

FieldPositions --> <FieldPositionDelta> NumFields-1 FieldPositionDelta --> VLong

The .tvd file is used to map out the fields that have term vectors stored and where the field information is in the .tvf file. </li> <li> The Field or .tvf file.

This file contains, for each field that has a term vector stored, a list of the terms, their frequencies and, optionally, position and offest information. Field (.tvf) --> TVFVersion<NumTerms, Position/Offset, TermFreqs> NumFields TVFVersion --> Int (TermVectorsReader.FORMAT_CURRENT)

NumTerms --> VInt

Position/Offset --> Byte

TermFreqs --> <TermText, TermFreq, Positions?, Offsets?> NumTerms TermText --> <PrefixLength, Suffix>

PrefixLength --> VInt

Suffix --> String

TermFreq --> VInt

Positions --> <VInt>^TermFreq

Offsets --> <VInt, VInt>^TermFreq

Notes:

<ul> <li>Position/Offset byte stores whether this term vector has position or offset information stored. <li>Term text prefixes are shared. The PrefixLength is the number of initial characters from the previous term which must be pre-pended to a term's suffix in order to form the term's text. Thus, if the previous term's text was "bone" and the term is "boy", the PrefixLength is two and the suffix is "y". </li> <li>Positions are stored as delta encoded VInts. This means we only store the difference of the current position from the last position <li>Offsets are stored as delta encoded VInts. The first VInt is the startOffset, the second is the endOffset. </ul> </li> </ol> </section> <section id="Deleted Documents">Deleted Documents The .del file is optional, and only exists when a segment contains deletions. Although per-segment, this file is maintained exterior to compound segment files. Deletions (.del) --> [Format],ByteCount,BitCount, Bits | DGaps (depending on Format) Format,ByteSize,BitCount --> Uint32 Bits --> <Byte> ByteCount DGaps --> <DGap,NonzeroByte> NonzeroBytesCount DGap --> VInt NonzeroByte --> Byte Format is Optional. -1 indicates DGaps. Non-negative value indicates Bits, and that Format is excluded. ByteCount indicates the number of bytes in Bits. It is typically (SegSize/8)+1. BitCount indicates the number of bits that are currently set in Bits. Bits contains one bit for each document indexed. When the bit corresponding to a document number is set, that document is marked as deleted. Bit ordering is from least to most significant. Thus, if Bits contains two bytes, 0x00 and 0x02, then document 9 is marked as deleted. DGaps represents sparse bit-vectors more efficiently than Bits. It is made of DGaps on indexes of nonzero bytes in Bits, and the nonzero bytes themselves. The number of nonzero bytes in Bits (NonzeroBytesCount) is not stored. For example, if there are 8000 bits and only bits 10,12,32 are set, DGaps would be used: (VInt) 1 , (byte) 20 , (VInt) 3 , (Byte) 1 </section> </section> <section id="Limitations">Limitations When referring to term numbers, Lucene's current implementation uses a Java <code>int to hold the term index, which means the maximum number of unique terms in any single index segment is ~2.1 billion times the term index interval (default 128) = ~274 billion. This is technically not a limitation of the index file format, just of Lucene's current implementation. Similarly, Lucene uses a Java <code>int to refer to document numbers, and the index file format uses an <code>Int32 on-disk to store document numbers. This is a limitation of both the index file format and the current implementation. Eventually these should be replaced with either <code>UInt64 values, or better yet, <code>VInt values which have no limit. </section> </body> </document>

Other Lucene examples (source code examples)

Here is a short list of links related to this Lucene fileformats.xml source code file:

Lucene example source code file (fileformats.xml)

This example Lucene source code file (fileformats.xml) is included in the DevDaily.com "Java Source Code Warehouse" project. The intent of this project is to help you "Learn Java by Example" ^TM.

Java - Lucene tags/keywords

bottom, file, if, in, lucene, lucene, right, right, term, the, the, this, this, vint

The Lucene fileformats.xml source code

<?xml version="1.0"?>

<document>
    <header>
        <title>
            Apache Lucene - Index File Formats
        </title>
    </header>

    <body>
        <section id="Index File Formats">Index File Formats

            <p>
                This document defines the index file formats used
                in this version of Lucene. If you are using a different
                version of Lucene, please consult the copy of
                <code>docs/fileformats.html
                that was distributed
                with the version you are using.
            </p>

            <p>
                Apache Lucene is written in Java, but several
                efforts are underway to write
                <a href="http://wiki.apache.org/lucene-java/LuceneImplementations">versions
                    of Lucene in other programming
                languages</a>.  If these versions are to remain compatible with Apache
                Lucene, then a language-independent definition of the Lucene index
                format is required.  This document thus attempts to provide a
                complete and independent definition of the Apache Lucene file
                formats.
            </p>

            <p>
                As Lucene evolves, this document should evolve.
                Versions of Lucene in different programming languages should endeavor
                to agree on file formats, and generate new versions of this document.
            </p>

            <p>
                Compatibility notes are provided in this document,
                describing how file formats have changed from prior versions.
            </p>

            <p>
                In version 2.1, the file format was changed to allow
                lock-less commits (ie, no more commit lock). The
                change is fully backwards compatible: you can open a
                pre-2.1 index for searching or adding/deleting of
                docs. When the new segments file is saved
                (committed), it will be written in the new file format
                (meaning no specific "upgrade" process is needed).
                But note that once a commit has occurred, pre-2.1
                Lucene will not be able to read the index.
            </p>

            <p>
                In version 2.3, the file format was changed to allow
		segments to share a single set of doc store (vectors &
		stored fields) files.  This allows for faster indexing
		in certain cases.  The change is fully backwards
		compatible (in the same way as the lock-less commits
		change in 2.1).
            </p>

            <p>
	        In version 2.4, Strings are now written as true UTF-8
	        byte sequence, not Java's modified UTF-8.  See issue
	        LUCENE-510 for details.
            </p>

	    <p>
	        In version 2.9, an optional opaque Map<String,String>
	        CommitUserData may be passed to IndexWriter's commit
	        methods (and later retrieved), which is recorded in
	        the segments_N file.  See issue LUCENE-1382 for
	        details.  Also, diagnostics were added to each segment
	        written recording details about why it was written
	        (due to flush, merge; which OS/JRE was used; etc.).
	        See issue LUCENE-1654 for details.
            </p>
	    
	    <p>
	        In version 3.0, compressed fields are no longer
	        written to the index (they can still be read, but on
	        merge the new segment will write them,
	        uncompressed). See issue LUCENE-1960 for details.
            </p>

        <p>
            In version 3.1, segments records the code version
            that created them. See LUCENE-2720 for details.
            
            Additionally segments track explicitly whether or
            not they have term vectors. See LUCENE-2811 for details.
           </p>
        <p>
            In version 3.2, numeric fields are written as natively
            to stored fields file, previously they were stored in
            text format only.
           </p>
        </section>

        <section id="Definitions">Definitions

            <p>
                The fundamental concepts in Lucene are index,
                document, field and term.
            </p>


            <p>
                An index contains a sequence of documents.
            </p>

            <ul>
                <li>
                    <p>
                        A document is a sequence of fields.
                    </p>
                </li>

                <li>
                    <p>
                        A field is a named sequence of terms.
                    </p>
                </li>

                <li>
                    A term is a string.
                </li>
            </ul>

            <p>
                The same string in two different fields is
                considered a different term.  Thus terms are represented as a pair of
                strings, the first naming the field, and the second naming text
                within the field.
            </p>

            <section id="Inverted Indexing">Inverted Indexing

                <p>
                    The index stores statistics about terms in order
                    to make term-based search more efficient.  Lucene's
                    index falls into the family of indexes known as an <i>inverted
                        index.</i> This is because it can list, for a term, the documents that contain
                    it.  This is the inverse of the natural relationship, in which
                    documents list terms.
                </p>
            </section>
            <section id="Types of Fields">
                <title>Types of Fields
                <p>
                    In Lucene, fields may be <i>stored, in which
                    case their text is stored in the index literally, in a non-inverted
                    manner.  Fields that are inverted are called <i>indexed. A field
                    may be both stored and indexed.</p>

                <p>The text of a field may be tokenized into terms to be
                    indexed, or the text of a field may be used literally as a term to be indexed.
                    Most fields are
                    tokenized, but sometimes it is useful for certain identifier fields
                    to be indexed literally.
                </p>
                <p>See the Field java docs for more information on Fields.
            </section>

            <section id="Segments">Segments

                <p>
                    Lucene indexes may be composed of multiple sub-indexes, or
                    <i>segments. Each segment is a fully independent index, which could be searched
                    separately. Indexes evolve by:
                </p>

                <ol>
                    <li>
                        <p>Creating new segments for newly added documents.
                    </li>
                    <li>
                        <p>Merging existing segments.
                    </li>
                </ol>

                <p>
                    Searches may involve multiple segments and/or multiple indexes, each
                    index potentially composed of a set of segments.
                </p>
            </section>

            <section id="Document Numbers">Document Numbers

                <p>
                    Internally, Lucene refers to documents by an integer <i>document
                        number</i>. The first document added to an index is numbered zero, and each
                    subsequent document added gets a number one greater than the previous.
                </p>

                <p>
                    <br/>
                </p>

                <p>
                    Note that a document's number may change, so caution should be taken
                    when storing these numbers outside of Lucene. In particular, numbers may
                    change in the following situations:
                </p>


                <ul>
                    <li>
                        <p>
                            The
                            numbers stored in each segment are unique only within the segment,
                            and must be converted before they can be used in a larger context.
                            The standard technique is to allocate each segment a range of
                            values, based on the range of numbers used in that segment.  To
                            convert a document number from a segment to an external value, the
                            segment's <i>base document
                            number is added.  To convert an external value back to a
                            segment-specific value, the  segment is identified by the range that
                            the external value is in, and the segment's base value is
                            subtracted.  For example two five document segments might be
                            combined, so that the first segment has a base value of zero, and
                            the second of five.  Document three from the second segment would
                            have an external value of eight.
                        </p>
                    </li>
                    <li>
                        <p>
                            When documents are deleted, gaps are created
                            in the numbering. These are eventually removed as the index evolves
                            through merging. Deleted documents are dropped when segments are
                            merged. A freshly-merged segment thus has no gaps in its numbering.
                        </p>
                    </li>
                </ul>

            </section>

        </section>

        <section id="Overview">Overview

            <p>
                Each segment index maintains the following:
            </p>
            <ul>
                <li>
                    <p>Field names. This
                        contains the set of field names used in the index.

                    </p>
                </li>
                <li>
                    <p>Stored Field
                        values. This contains, for each document, a list of attribute-value
                        pairs, where the attributes are field names. These are used to
                        store auxiliary information about the document, such as its title,
                        url, or an identifier to access a
                        database. The set of stored fields are what is returned for each hit
                        when searching. This is keyed by document number.
                    </p>
                </li>
                <li>
                    <p>Term dictionary.
                        A dictionary containing all of the terms used in all of the indexed
                        fields of all of the documents. The dictionary also contains the
                        number of documents which contain the term, and pointers to the
                        term's frequency and proximity data.
                    </p>
                </li>

                <li>
                    <p>Term Frequency
                        data. For each term in the dictionary, the numbers of all the
                        documents that contain that term, and the frequency of the term in
                        that document if omitTf is false.
                    </p>
                </li>

                <li>
                    <p>Term Proximity
                        data. For each term in the dictionary, the positions that the term
                        occurs in each document.  Note that this will
                        not exist if all fields in all documents set
                        omitTf to true.
                    </p>
                </li>

                <li>
                    <p>Normalization
                        factors. For each field in each document, a value is stored that is
                        multiplied into the score for hits on that field.
                    </p>
                </li>
                <li>
                    <p>Term Vectors. For each field in each document, the term vector
                        (sometimes called document vector) may be stored. A term vector consists
                        of term text and term frequency. To add Term Vectors to your index see the
                        <a href="api/core/org/apache/lucene/document/Field.html">Field
                        constructors
                    </p>
                </li>
                <li>
                    <p>Deleted documents.
                        An optional file indicating which documents are deleted.
                    </p>
                </li>
            </ul>

            <p>Details on each of these are provided in subsequent sections.
            </p>
        </section>

        <section id="File Naming">File Naming

            <p>
                All files belonging to a segment have the same name with varying
                extensions. The extensions correspond to the different file formats
                described below. When using the Compound File format (default in 1.4 and greater) these files are
                collapsed into a single .cfs file (see below for details)
            </p>

            <p>
                Typically, all segments
                in an index are stored in a single directory, although this is not
                required.
            </p>

            <p>
                As of version 2.1 (lock-less commits), file names are
                never re-used (there is one exception, "segments.gen",
                see below). That is, when any file is saved to the
                Directory it is given a never before used filename.
                This is achieved using a simple generations approach.
                For example, the first segments file is segments_1,
                then segments_2, etc. The generation is a sequential
                long integer represented in alpha-numeric (base 36)
                form.
            </p>

        </section>
      <section id="file-names">Summary of File Extensions
        <p>The following table summarizes the names and extensions of the files in Lucene:
          <table>
            <tr>
              <th>Name
              <th>Extension
              <th>Brief Description
            </tr>
            <tr>
              <td>Segments File

Copyright 1998-2021 Alvin Alexander, alvinalexander.com
All Rights Reserved.

A percentage of advertising revenue from
pages under the /java/jwarehouse URI on this website is
paid back to open source projects.

... this post is sponsored by my books ...
#1 New Release!	FP Best Seller

Other Lucene examples (source code examples)

Lucene example source code file (fileformats.xml)

Java - Lucene tags/keywords

The Lucene fileformats.xml source code

new blog posts