[Für unsere deutschsprachigen Leser: Wegen unserer vielen internationalen Kunden werden in diesem Blog oft englischsprachige Einträge zu lesen sein…]
Those who know our products are aware that we’ve partnered with Oracle for a very long time. While in the beginning there was DC3 with its proprietary, but very fast and efficient fulltext search engine, our DC4 and DC5 product lines relied heavily on Oracle Text. (With the exception of a few PostgreSQL-based installations, integrated with another homegrown search engine we internally call FTX.) Some of our installations are probably among the largest Oracle Text setups worldwide.
Although Oracle Text has its benefits (tight database-level integration, extensive query syntax, thesaurus support, multiple parallel indexes, partitioning, good internationalization), there are lots of drawbacks as well:
- Unstable query performance: In some installations, a few simple search terms would take up to fifty times longer than others, which caused us a lot of headaches and loss of reputation. That was an extreme case, but in general the performance of Oracle Text varied way too much. (And we invested a lot of time in finding ways to optimize it.)
- Oracle Text is hard to scale: Since it’s integrated into the database, you have to scale that as a whole (via RAC), which is expensive and inflexible. There’s no way to split a large fulltext index across lots of servers.
- It is missing support for total document count and faceted search; if you’d like to fetch the first ten matching documents only, you have to run a second search to get the total count.
- Oracle support often wasn’t very helpful; Text doesn’t seem to be a high priority product for them and it’s hard to find someone at Oracle who knows Text very well.
- While the database integration is a nice feature on one hand, on the other hand it makes it hard to customize what’s going into the fulltext index, and the fulltext index synchronization can slow down batch jobs.
- Today, every customer wants a Google-like query syntax, which isn’t provided by Oracle.
- It’s obviously bound to an expensive Oracle license, so customers with a lower budget running PostgreSQL (or MySQL) need a different search engine, and it’s a pain having to support multiple search engines.
During the last years, we evaluated a couple of alternatives. FAST search looked great, but was quite expensive and we weren’t shown a solution that would update the index in almost realtime (our customers cannot wait 10 minutes for the latest news to appear). Most open source search engines we tested seemed to have issues with Unicode, millions of documents or almost-realtime index updates.
But then we started to hear more and more about Lucene (thanks to everyone who pointed us in this direction!), and when we found out that there was a HTTP-/XML-based interface for it in the form of Solr (which made it easier to call from our PHP programming environment), we took our time to test it thoroughly and were impressed. It handled our testbed of 8 million (relatively large) documents well (on a single server), index updates seemed sufficiently fast and query performance was stable. Plus we got a Google-like search syntax, faceted queries and the total document count for free.
Solr and Lucene are open source, being actively developed and widely used, should be easier to scale than Oracle Text, and we have more freedom to customize index contents since it is decoupled from the database. So we finally decided to make the switch away from Oracle Text to Solr, starting with our new DC-X product. (This switch is just about the search engine, the data itself is still held in either Oracle or MySQL.) It is not in production use yet, but runs stable and very fast on our demo and development servers. The first installations at customer sites will start soon.
We’re excited about this move, and while I’m sure there will be some quirks which we’ll have to get familiar with, it opens up quite a few interesting possibilities!