Andy

Level: 109
Rank: 325

How to build a search from scratch with Apache Lucene

The last two days I've been playing with Apache Lucene. And you can now search the forum.

Getting started

There are a few things I've learned about Lucene these days. It wasn't easy for me to figure out how to get started with the framework, so I'm writing down my experience while the memory is still fresh. I started with this Unit Test demo in the official Lucene repository. And from there I read a lot of internal Lucene code. And I have to say it is one of the most beautifuly framework code I've read. It's written with a lot of care, clear focus on performance and great Javadoc about the reasons behind the code where necessary. Before you start coding, you'd need to add the following dependencies:

<dependency>
    <groupid>org.apache.lucene</groupid>
    <artifactid>lucene-core</artifactid>
    <version>8.2.0</version>
</dependency>
<dependency>
    <groupid>org.apache.lucene</groupid>
    <artifactid>lucene-queryparser</artifactid>
    <version>8.2.0</version>
</dependency>
<dependency>
    <groupid>org.apache.lucene</groupid>
    <artifactid>lucene-highlighter</artifactid>
    <version>8.2.0</version>
</dependency>

Let's start by writing a first test. Since creating a Lucene index is time consuming it makes sense to use one index during the entire test.

@Execution(ExecutionMode.SAME_THREAD)
class LuceneForumSearchGatewayTest {

    private static Path indexPath;
    private static LuceneForumSearchGateway searchGateway;

    @BeforeAll
    static void beforeAll() throws IOException {
        indexPath = Files.createTempDirectory("tempIndex");
        searchGateway = new LuceneForumSearchGateway(indexPath);
    }
}

We create a temporary directory where Lucene can create and store the index and then we create the class we want to test. Wait, that one does not exist yet, so we have a production class to create!

public class LuceneForumSearchGateway {

    private final Directory directory;
    private final Analyzer analyzer;
    private final IndexWriter indexWriter;
    private final SearcherManager searcherManager;

    public LuceneForumSearchGateway(Path indexPath) throws IOException {
        directory = FSDirectory.open(indexPath);
        analyzer = new StandardAnalyzer();
        indexWriter = new IndexWriter(directory, new IndexWriterConfig(analyzer));
        searcherManager = new SearcherManager(indexWriter, null);
    }
}

There are four instances created here. First, Lucene needs a directory, that's where the search index will be created. There are different implementations, thus Directory is an abstract class. What we want to have is a file system directory. FSDirectory.open() is a utility method to create a file system directory instance that is best suited for the current platform. On a 64bit Linux server this will result in an MMapDirectory, a directory using memory mapped files and one of the reason why Lucene is so insanely fast. Next instance is an Analyzer which is extracting token streams for text. What this basically does is taking a text like "Ben and Jerry!" and extracts the terms "Ben", "Jerry" for the index. The word "and" is a stop word and thus removed as well as "!". Luckily all texts on this forum are English, so we just create a StandardAnalyzer. The IndexWriter is used to add documents to the index (in our case forum posts). It is quite expensive to open, but luckily once opened it is thread safe, so I keep this one instance open during the entire lifetime of the application. And finally, there is the SearcherManager, a helper class to search the index on multiple threads, without interfering with each other or the IndexWriter. We created a lot of expensive stuff, so we need to dispose of it when we are done! Let's add this tear down method to the test:

@AfterAll
static void afterAll() throws IOException {
    searchGateway.dispose();
    IOUtils.rm(indexPath);
}

And implement the method in production code. Note that we are good citizens and close all the above instances in the reverse order we created them.

public void dispose() throws IOException {
    searcherManager.close();
    indexWriter.close();
    analyzer.close();
    directory.close();
}

Adding documents

Now it's time to write the first real test! For the setup, we need some documents to search.

@Test
void search() throws IOException {
    searchGateway.add(new ForumSearchEntry(1, "Hello World"));
    searchGateway.add(new ForumSearchEntry(2, "We have a new tower"));
}

Lucene is organizing its index with so called documents. For the outside world we don't want to know about these internal details, so let's create a POJO for it, after all we want to search forum posts!

public class ForumSearchEntry {
    public final long postId;
    public final String content;

    public ForumSearchEntry(long postId, String content) {
        this.postId = postId;
        this.content = content;
    }
}

Now we can implement the add method:

public void add(ForumSearchEntry entry) throws IOException {
    Document document = createDocument(entry);
    indexWriter.addDocument(document);
}

private Document createDocument(ForumSearchEntry entry) {
    Document document = new Document();

    document.add(new TextField("content", entry.content, Store.YES));
    document.add(new StoredField("postId", entry.postId));
    return document;
}

Quite straight forward so far! We create a document, create a TextField with the post content and a StoredField for the post identifier. Note that we need to store the content, because we're want to highlight the results later on. Okay, it is time to write our first assertion. Let's see how many documents we have in the index now.

@Test
void search() throws IOException {
    searchGateway.add(new ForumSearchEntry(1, "Hello World"));
    searchGateway.add(new ForumSearchEntry(2, "We have a new tower"));

    assertThat(searchGateway.getCount()).isEqualTo(2);
}

Let's add this missing getCount() method...

public int getCount() {
    try {
        IndexSearcher indexSearcher = searcherManager.acquire();
        try {
            return indexSearcher.getIndexReader().numDocs();
        } finally {
            searcherManager.release(indexSearcher);
        }
    } catch (IOException e) {
        logger.error("Failed to determine lucene document count.", e);
        return 0;
    }
}

Okay, let's explain what's going on here. Remember the searcherManager? It allows to aquire an IndexSearcher in a thread-safe and still very fast way. We first aquire the searcher, then we use it and we release the searcher when we are done with in the finally block (even if an exception is thrown). In this case we simply ask for the total document count in the index. Time to run that test! And... The test is red. Expected 2, got 0. So what's going on there? Lucene does not write documents instantly to the index unless you commit them. So let's adjust our test:

@Test
void search() throws IOException {
    searchGateway.add(new ForumSearchEntry(1, "Hello World"));
    searchGateway.add(new ForumSearchEntry(2, "We have a new tower"));
    searchGateway.commit();

    assertThat(searchGateway.getCount()).isEqualTo(2);
}

And implement the commit method like so (we call commit on the IndexWriter and we also ask the searcherManager for a refresh):

public void commit() throws IOException {
    indexWriter.commit();
    searcherManager.maybeRefresh();
}

Now the test is green, we have two documents in the index!

Searching documents

So far this is not very impressive, I know. Let's go ahead and search some stuff!

@Test
void search() throws IOException {
    searchGateway.add(new ForumSearchEntry(1, "Hello World"));
    searchGateway.add(new ForumSearchEntry(2, "We have a new tower"));
    searchGateway.commit();

    assertThat(searchGateway.getCount()).isEqualTo(2);
    assertSearchResult("hello", new ForumSearchEntry(1, "<b>Hello</b> World"));
}

private void assertSearchResult(String query, ForumSearchEntry... expected) {
    ForumSearchEntry[] results = searchGateway.search(query, 10);
    assertThat(results.length).isEqualTo(expected.length);
    for (int i = 0; i < results.length; ++i) {
        assertThat(results[i].postId).isEqualTo(expected[i].postId);
        assertThat(results[i].content).isEqualTo(expected[i].content);
    }
}

And here is the search method:

private static final ForumSearchEntry[] NO_RESULT = {};

private final Formatter formatter = new SimpleHTMLFormatter("<b>", "</b>");

public ForumSearchEntry[] search(String query, int limit) {
    try {
        IndexSearcher indexSearcher = searcherManager.acquire();
        try {
            SimpleQueryParser queryParser = new SimpleQueryParser(analyzer, "content");

            Query luceneQuery = queryParser.parse(query);
            TopDocs topDocs = indexSearcher.search(luceneQuery, limit);

            int count = topDocs.scoreDocs.length;
            if (count == 0) {
                return NO_RESULT;
            }

            QueryScorer scorer = new QueryScorer(luceneQuery);

            Highlighter highlighter = new Highlighter(formatter, scorer);
            highlighter.setTextFragmenter(new SimpleSpanFragmenter(scorer));

            ForumSearchEntry[] result = new ForumSearchEntry[count];
            for (int i = 0; i < count; ++i) {
                int documentId = topDocs.scoreDocs[i].doc;
                Document document = indexSearcher.doc(documentId);

                long postId = (Long)document.getField("postId").numericValue()
                String content = document.get("content");

                TokenStream stream = TokenSources.getAnyTokenStream(indexSearcher.getIndexReader(), documentId, "content", analyzer);

                try {
                    content = highlighter.getBestFragments(stream, content, 5, "...");
                } catch (InvalidTokenOffsetsException e) {
                    logger.warn("Failed to highlight content fragments, will use entire content");
                }

                result[i] = new ForumSearchEntry(postId, content);
            }

            return result;
        } finally {
            searcherManager.release(indexSearcher);
        }
    } catch (IOException e) {
        logger.error("Failed to search lucene index. Query: " + query, e);
        return NO_RESULT;
    }
}

Let's go through the method step by step. First we aquire a IndexSearcher the same way we did before. Just make sure to release it properly in the finally block! Since we're dealing with a search string entered by a human being, we use the SimpleQueryParser to parse it. We can then perform a search via the IndexSearcher. If not documents are found, we can return right there with an empty result. If not, we initialize a Highlighter to mark the relevant sections in the content. For every document found, we obtain a TokenStream for the content and then pass it to the highlighter to mark the relevant sections. In our case they are simply wrapped in bold tags. Then we turn the document back to our ForumSearchEntry POJO. If all went well, our test is now green!

Updating documents

What happens if a post is edited? Right now, if we simply add a new document, we end up with another document, the old one is not removed. Let's write a test to see if this is true:

@Test
void update() throws IOException {
    searchGateway.add(new ForumSearchEntry(1, "Hello World"));
    searchGateway.commit();

    searchGateway.add(new ForumSearchEntry(1, "Hello World2"));
    searchGateway.commit();

    assertSearchResult("world2", new ForumSearchEntry(1, "Hello <b>World2</b>"));
    assertThat(searchGateway.getCount()).isEqualTo(1);
}

The probleme right now is, that lucene does not know about postId being our primary key. So let's rename the add() method to update(), since that's what the application will use. Here is the update method implementation:

private static final FieldType ID_TYPE;

static {
    ID_TYPE = new FieldType();
    ID_TYPE.setIndexOptions(IndexOptions.DOCS);
    ID_TYPE.setTokenized(true);
    ID_TYPE.freeze();
}

public void update(ForumSearchEntry entry) throws IOException {
    indexWriter.updateDocument(byId(entry.postId), createDocument(entry));
}

private Term byId(long postId) {
    return new Term("postIdLookup", longToBytesRef(postId));
}

private Document createDocument(ForumSearchEntry entry) {
    Document document = new Document();

    BytesRef idBytes = longToBytesRef(entry.postId);

    document.add(new Field("postIdLookup", new BinaryTokenStream(idBytes), ID_TYPE));
    document.add(new TextField("content", entry.content, Store.YES));
    document.add(new StoredField("postId", idBytes));

    return document;
}

private BytesRef longToBytesRef(long value) {
    BytesRef bytesRef = new BytesRef(8);
    bytesRef.length = 8;
    NumericUtils.longToSortableBytes(value, bytesRef.bytes, 0);
    return bytesRef;
}

private long bytesRefToLong(BytesRef bytesRef) {
    return NumericUtils.sortableBytesToLong(bytesRef.bytes, 0);
}

Next to the stored postId field we need another one that can be indexed. For that we need to add the ID (a long) as binary tokenized data to the index. Simply converting the long to binary is not very efficient for the way Lucene works, fortunately there is a helper class to deal with that: NumericUtils. We also need to define a new FieldType for this to work. Finally we can then pass a term to the update document method, which ensures that the postId is unique!

Using the gateway

That's it, we've written a basic search using Apache Lucene. I'd recommend to create an interface like this to use it. That way Lucene is nicely separated from the rest of the application:

public interface ForumSearchGateway {
    void update(ForumSearchEntry entry) throws IOException;
    void delete(long postId) throws IOException;
    void commit() throws IOException;
    ForumSearchEntry[] search(String query, int limit);
    int getCount();
}

In my case, whenever a post is created, edited or deleted I'm calling the appropriate method on the gateway. On application startup I'm comparing the amount of posts in the DB with the amount of documents in the index and decide if I need to do a re-index or not on a background thread. When the user is searching stuff, I'm calling search. Your application might have other needs, but you get the idea.

Final thoughts

Apache Lucene is a powerful search engine! For a simple search like I wanted to do, it is a perfect fit. You don't necessarily need the complexity of administrating another server instance running solr or elasticsearch. The search on this site is completing queries in about 2ms to 50ms, without the need to access the database. Disclaimer: I'm no Lucene expert, only because this is working on my little website, doesn't mean this is the perfect way to go for all scenarios. Also, if you are a Lucene expert and you notice me doing something weird, I'm happy for any feedback. The comments section in this forum is for players of the game only, but you can drop me an email anytime at andy [at] mazebert.com.

HerbertBert

Level: 92
Rank: 873

very nice blog post - I have been tinkering a bit with Lucene too - but I didn't get as far as you. So thanks for your shared experience :-)

Andy

Level: 109
Rank: 325

Thanks @HerbertBert!

Andy

Level: 109 Rank: 325

How to build a search from scratch with Apache Lucene

Getting started

Adding documents

Searching documents

Updating documents

Using the gateway

Final thoughts

HerbertBert

Level: 92 Rank: 873

Andy

Level: 109 Rank: 325

Level: 109
Rank: 325

Level: 92
Rank: 873

Level: 109
Rank: 325