Text Indexes

Text indexes combine an inverted index and a lexicon to support text indexing and searching. A text index can be created without passing any arguments:

>>> from zope.index.text.textindex import TextIndex
>>> index = TextIndex()

By default, it uses an “Okapi” inverted index and a lexicon with a pipeline consistening is a simple word splitter, a case normalizer, and a stop-word remover.

We index text using the index_doc method:

>>> index.index_doc(1, u"the quick brown fox jumps over the lazy dog")
>>> index.index_doc(2,
...    u"the brown fox and the yellow fox don't need the retriever")
>>> index.index_doc(3, u"""
... The Conservation Pledge
... =======================
...
... I give my pledge, as an American, to save, and faithfully
... to defend from waste, the natural resources of my Country;
... it's soils, minerals, forests, waters and wildlife.
... """)
>>> index.index_doc(4, u"Fran\xe7ois")
>>> word = (
...     u"\N{GREEK SMALL LETTER DELTA}"
...     u"\N{GREEK SMALL LETTER EPSILON}"
...     u"\N{GREEK SMALL LETTER LAMDA}"
...     u"\N{GREEK SMALL LETTER TAU}"
...     u"\N{GREEK SMALL LETTER ALPHA}"
...     )
>>> index.index_doc(5, word + u"\N{EM DASH}\N{GREEK SMALL LETTER ALPHA}")
>>> index.index_doc(6, u"""
... What we have here, is a failure to communicate.
... """)
>>> index.index_doc(7, u"""
... Hold on to your butts!
... """)
>>> index.index_doc(8, u"""
... The Zen of Python, by Tim Peters
...
... Beautiful is better than ugly.
... Explicit is better than implicit.
... Simple is better than complex.
... Complex is better than complicated.
... Flat is better than nested.
... Sparse is better than dense.
... Readability counts.
... Special cases aren't special enough to break the rules.
... Although practicality beats purity.
... Errors should never pass silently.
... Unless explicitly silenced.
... In the face of ambiguity, refuse the temptation to guess.
... There should be one-- and preferably only one --obvious way to do it.
... Although that way may not be obvious at first unless you're Dutch.
... Now is better than never.
... Although never is often better than *right* now.
... If the implementation is hard to explain, it's a bad idea.
... If the implementation is easy to explain, it may be a good idea.
... Namespaces are one honking great idea -- let's do more of those!
... """)

Then we can search using the apply method, which takes a search string.

>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown fox').items()]
[(1, '0.6153'), (2, '0.6734')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'quick fox').items()]
[(1, '0.6153')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown python').items()]
[]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'dalmatian').items()]
[]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'brown or python').items()]
[(1, '0.2602'), (2, '0.2529'), (8, '0.0934')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(u'butts').items()]
[(7, '0.6948')]

The outputs are mappings from document ids to float scores. Items with higher scores are more relevent.

We can use unicode characters in search strings.

>>> [(k, "%.4f" % v) for (k, v) in index.apply(u"Fran\xe7ois").items()]
[(4, '0.7427')]
>>> [(k, "%.4f" % v) for (k, v) in index.apply(word).items()]
[(5, '0.7179')]

We can use globbing in search strings.

>>> [(k, "%.3f" % v) for (k, v) in index.apply('fo*').items()]
[(1, '2.179'), (2, '2.651'), (3, '2.041')]

Text indexes support basic statistics:

>>> index.documentCount()
8
>>> index.wordCount()
114

If we index the same document twice, once with a zero value, and then with a normal value, it should still work:

>>> index2 = TextIndex()
>>> index2.index_doc(1, [])
>>> index2.index_doc(1, ["Zorro"])
>>> [(k, "%.4f" % v) for (k, v) in index2.apply("Zorro").items()]
[(1, '0.4545')]

Tracking Changes

If we index a document the first time it updates the _totaldoclen of the underlying object.

>>> index = TextIndex()
>>> index.index._totaldoclen()
0
>>> index.index_doc(100, u"a new funky value")
>>> index.index._totaldoclen()
3

If we index it a second time, the underlying index length should not be changed.

>>> index.index_doc(100, u"a new funky value")
>>> index.index._totaldoclen()
3

But if we change it the length changes too.

>>> index.index_doc(100, u"an even newer funky value")
>>> index.index._totaldoclen()
5

The same as for index_doc applies to unindex_doc, if an object is unindexed that is not indexed no indexes chould change state.

>>> index.unindex_doc(100)
>>> index.index._totaldoclen()
0
>>> index.unindex_doc(100)
>>> index.index._totaldoclen()
0

Reference

Interfaces

Text-indexing interfaces

Indexes

Text index.

Abstract base class for full text index with relevance ranking.

Cosine

class zope.index.text.cosineindex.CosineIndex(lexicon, family=None)[source]

Full text index with relevance ranking, using a cosine measure.

Okapi

Full text index with relevance ranking, using an Okapi BM25 rank.

“Okapi” (much like “cosine rule” also) is a large family of scoring gimmicks. It’s based on probability arguments about how words are distributed in documents, not on an abstract vector space model. A long paper by its principal inventors gives an excellent overview of how it was derived:

A probabilistic model of information retrieval: development and status K. Sparck Jones, S. Walker, S.E. Robertson http://citeseer.nj.nec.com/jones98probabilistic.html

Spellings that ignore relevance information (which we don’t have) are of this high-level form:

score(D, Q) = sum(for t in D&Q: TF(D, t) * IDF(Q, t))

where

D a specific document

Q a specific query

t a term (word, atomic phrase, whatever)

D&Q the terms common to D and Q

TF(D, t) a measure of t’s importance in D – a kind of term frequency

weight

IDF(Q, t) a measure of t’s importance in the query and in the set of

documents as a whole – a kind of inverse document frequency weight

The IDF(Q, t) here is identical to the one used for our cosine measure. Since queries are expected to be short, it ignores Q entirely:

IDF(Q, t) = log(1.0 + N / f(t))

where

N the total number of documents f(t) the number of documents in which t appears

Most Okapi literature seems to use log(N/f(t)) instead. We don’t, because that becomes 0 for a term that’s in every document, and, e.g., if someone is searching for “documentation” on python.org (a term that may well show up on every page, due to the top navigation bar), we still want to find the pages that use the word a lot (which is TF’s job to find, not IDF’s – we just want to stop IDF from considering this t to be irrelevant).

The TF(D, t) spellings are more interesting. With lots of variations, the most basic spelling is of the form:

               f(D, t)
TF(D, t) = ---------------
            f(D, t) + K(D)

where

f(D, t) the number of times t appears in D K(D) a measure of the length of D, normalized to mean doc length

The functional form f/(f+K) is clever. It’s a gross approximation to a mixture of two distinct Poisson distributions, based on the idea that t probably appears in D for one of two reasons:

  1. More or less at random.

  2. Because it’s important to D’s purpose in life (“eliteness” in papers).

Note that f/(f+K) is always between 0 and 1. If f is very large compared to K, it approaches 1. If K is very large compared to f, it approaches 0. If t appears in D more or less “for random reasons”, f is likely to be small, and so K will dominate unless it’s a very small doc, and the ratio will be small. OTOH, if t appears a lot in D, f will dominate unless it’s a very large doc, and the ratio will be close to 1.

We use a variation on that simple theme, a simplification of what’s called BM25 in the literature (it was the 25th stab at a Best Match function from the Okapi group; “a simplification” means we’re setting some of BM25’s more esoteric free parameters to 0):

            f(D, t) * (k1 + 1)
TF(D, t) = --------------------
            f(D, t) + k1 * K(D)

where

k1 a “tuning factor”, typically between 1.0 and 2.0. We use 1.2,

the usual default value. This constant adjusts the curve to look more like a theoretical 2-Poisson curve.

Note that as f(D, t) increases, TF(D, t) increases monotonically, approaching an asymptote of k1+1 from below.

Finally, we use:

K(D) = (1-b) + b * len(D)/E(len(D))

where

b is another free parameter, discussed below. We use 0.75.

len(D) the length of D in words

E(len(D)) the expected value of len(D) across the whole document set;

or, IOW, the average document length

b is a free parameter between 0.0 and 1.0, and adjusts for the expected effect of the “Verbosity Hypothesis”. Suppose b is 1, and some word t appears 10 times as often in document d2 than in document d1. If document d2 is also 10 times as long as d1, TF(d1, t) and TF(d2, t) are identical:

                  f(d2, t) * (k1 + 1)
TF(d2, t) = --------------------------------- =
             f(d2, t) + k1 * len(d2)/E(len(D))

                         10 * f(d1, t) * (k1 + 1)
            ----------------------------------------------- = TF(d1, t)
             10 * f(d1, t) + k1 * (10 * len(d1))/E(len(D))

because the 10’s cancel out. This is appropriate if we believe that a word appearing 10x more often in a doc 10x as long is simply due to that the longer doc is more verbose. If we do believe that, the longer doc and the shorter doc are probably equally relevant. OTOH, it could be that the longer doc is talking about t in greater depth too, in which case it’s probably more relevant than the shorter doc.

At the other extreme, if we set b to 0, the len(D)/E(len(D)) term vanishes completely, and a doc scores higher for having more occurences of a word regardless of the doc’s length.

Reality is between these extremes, and probably varies by document and word too. Reports in the literature suggest that b=0.75 is a good compromise “in general”, favoring the “verbosity hypothesis” end of the scale.

Putting it all together, the final TF function is:

                       f(D, t) * (k1 + 1)
TF(D, t) = --------------------------------------------
            f(D, t) + k1 * ((1-b) + b*len(D)/E(len(D)))

with k1=1.2 and b=0.75.

Query Term Weighting

I’m ignoring the query adjustment part of Okapi BM25 because I expect our queries are very short. Full BM25 takes them into account by adding the following to every score(D, Q); it depends on the lengths of D and Q, but not on the specific words in Q, or even on whether they appear in D(!):

               E(len(D)) - len(D)
k2 * len(Q) * -------------------
               E(len(D)) + len(D)

Here k2 is another “tuning constant”, len(Q) is the number of words in Q, and len(D) & E(len(D)) were defined above. The Okapi group set k2 to 0 in TREC-9, so it apparently doesn’t do much good (or may even hurt).

Full BM25 also multiplies the following factor into IDF(Q, t):

f(Q, t) * (k3 + 1)
------------------
   f(Q, t) + k3

where k3 is yet another free parameter, and f(Q,t) is the number of times t appears in Q. Since we’re using short “web style” queries, I expect f(Q,t) to always be 1, and then that quotient is:

1 * (k3 + 1)
------------ = 1
   1 + k3

regardless of k3’s value. So, in a trivial sense, we are incorporating this measure (and optimizing it by not bothering to multiply by 1 <wink>).

Utilities

HTML Splitter

Lexicon

SetOps – Weighted intersections and unions applied to many inputs.

Provide a default list of stop words for the index.

The specific splitter and lexicon are customizable, but the default ZCTextIndex should do something useful.

Encodings

Rice coding (a variation of Golomb coding)

Based on a Java implementation by Glen McCluskey described in a Usenix article at http://www.usenix.org/publications/login/2000-4/features/java.html

McCluskey’s article explains the approach as follows. The encoding for a value x is represented as a unary part and a binary part. The unary part is a sequence of 1 bits followed by a 0 bit. The binary part encodes some of the lower bits of x-1.

The encoding is parameterized by a value m that describes how many bits to store in the binary part. If most of the values are smaller than 2**m then they can be stored in only m+1 bits.

Compute the length of the unary part, q, where

q = math.floor((x-1)/ 2 ** m)

Emit q 1 bits followed by a 0 bit.

Emit the lower m bits of x-1, treating x-1 as a binary value.

Widcode

A byte-aligned encoding for lists of non-negative ints, using fewer bytes for smaller ints. This is intended for lists of word ids (wids). The ordinary string .find() method can be used to find the encoded form of a desired wid-string in an encoded wid-string. As in UTF-8, the initial byte of an encoding can’t appear in the interior of an encoding, so find() can’t be fooled into starting a match “in the middle” of an encoding. Unlike UTF-8, the initial byte does not tell you how many continuation bytes follow; and there’s no ASCII superset property.

Details:

  • Only the first byte of an encoding has the sign bit set.

  • The first byte has 7 bits of data.

  • Bytes beyond the first in an encoding have the sign bit clear, followed by 7 bits of data.

  • The first byte doesn’t tell you how many continuation bytes are following. You can tell by searching for the next byte with the high bit set (or the end of the string).

The int to be encoded can contain no more than 28 bits.

If it contains no more than 7 bits, 0abcdefg, the encoding is

1abcdefg

If it contains 8 thru 14 bits,

00abcdef ghijkLmn

the encoding is

1abcdefg 0hijkLmn

Static tables _encoding and _decoding capture all encodes and decodes for 14 or fewer bits.

If it contains 15 thru 21 bits,

000abcde fghijkLm nopqrstu

the encoding is

1abcdefg 0hijkLmn 0opqrstu

If it contains 22 thru 28 bits,

0000abcd efghijkL mnopqrst uvwxyzAB

the encoding is

1abcdefg 0hijkLmn 0opqrstu 0vwxyzAB

Query Parser

Query Parser.

This particular parser recognizes the following syntax:

Start = OrExpr
OrExpr = AndExpr ('OR' AndExpr)*
AndExpr = Term ('AND' NotExpr | 'NOT' AndExpr)*
NotExpr = ['NOT'] Term
Term = '(' OrExpr ')' | ATOM+

The key words (AND, OR, NOT) are recognized in any mixture of case.

An ATOM is either:

  • A sequence of characters not containing whitespace or parentheses or double quotes, and not equal (ignoring case) to one of the key words ‘AND’, ‘OR’, ‘NOT’; or

  • A non-empty string enclosed in double quotes. The interior of the string can contain whitespace, parentheses and key words, but not quotes.

  • A hyphen followed by one of the two forms above, meaning that it must not be present.

An unquoted ATOM may also contain globbing characters. Globbing syntax is defined by the lexicon; for example “foo*” could mean any word starting with “foo”.

When multiple consecutive ATOMs are found at the leaf level, they are connected by an implied AND operator, and an unquoted leading hyphen is interpreted as a NOT operator.

Summarizing the default operator rules:

  • a sequence of words without operators implies AND, e.g. foo bar

  • double-quoted text implies phrase search, e.g. "foo bar"

  • words connected by punctuation implies phrase search, e.g. foo-bar

  • a leading hyphen implies NOT, e.g. foo -bar

  • these can be combined, e.g. foo -"foo bar" or foo -foo-bar

  • ? and * are used for globbing (i.e. prefix search), e.g. foo*