Tuesday, June 5, 2007

Benchmarking results of mysql, lucene and sphinx...

Finally i was able to do a benchmark of the 3 search engines
- mysql fulltext search
- lucene search engine
- sphinx www.sphinxsearch.com

I came across sphinx while doing a search on full text search engines. It is a very good engine. Few points regarding sphinx
-> Very simple to configure, create index and search
-> Very easy to integrate with php and mysql. APIs for the same are also available. I was able to build index and search using sphinx in a few hours.
-> The index which has been created is a combination of all fields of mysql. There is no distinction between different fields being searched. So you can perform search on an index and not on different fields of the index.
-> Of course since its source code is available, the searching process can be customized according to your needs. Moreover 0.9.6 version which is under development will be providing field wise search.
-> Since this is in C, it is supposed to be faster as compared to lucene.

I did the benchmarking on my own laptop. It is a dell Inspiron 700m running linux (fedora core 4) kernel 2.6.11. Configuration of the m/c ==>>
Processor : Intel(R) Pentium(R) M processor 1.60GHz
Cache size : 2048 KB
Memory : 512 MB

I got down a table containing 1 Lakh (100,000) records. The data size was 456 MB. And created index on some fields from the table.

INDEXING TIME

Mysql Version - 5.1.9 (Beta)
Stop words : Built in
Indexing words of length >=2 & <=84 ( There is a feature in mysql only which allows you to specify the min & max length of words you want to index. By default min length is 4. I changed it to 2 so that i can index words like net, php, hr etc. If you want to index all words, change this to 1.
Indexing time : 1440 seconds
Index size : 263 MB (456 MB data - remains same).

Lucene Version - 1.9.1
Stop words : 660 general words (like is, if, this etc...)
Indexing time : 2761 seconds (default configuration was used during indexing. There are certain parameters like mergefactor and maxmergedocs using which indexing can be tuned to work much faster. Though it may result in Too Many Open Files error in linux.
Index Size : 109 MB (No data is stored. Had stored only the unique id of each document using which i can retrieve the document later.)

Sphinx Version - 0.9.5
Stop words : NONE
Indexing time : 246 seconds (using default configuration. Dont have much idea whether indexing can be tuned.)
Index Size : 211.1 MB (3.1 MB Index - .spi file + 208 MB Data - .spd file).
3.1 MB of index looks extremely good. Also in case of sphinx, there is no need to maintain separate id for retrieval of data, since the unique id of your data is maintained in the index. As compared to lucene where you have to maintain a separate id and enforce uniqueness on it with your program. The indexing time and data compression are both good.

SEARCH TIME

The searches done were using scripts. I did a number of searches on randomly selected words and then came out with an average time. In case of lucene and mysql, the search was done on 2 fields with an OR between them. In case of sphinx the search was done on the complete index.

Searches/ThreadConcurrency - no of simultaneous threadsTotal searchesTotal time (milli seconds)Average time (milli seconds)
MySQL
51534715369430.6
521072735972735.9
10330228839276279.73
Found that search for an exact phrase which can be done using "in boolean mode" queries is more resource hungry. The query time in mysql is extremely high. Mysql runs purely on RAM, so with more RAM and accordingly configured mysql the queries would be much faster. Concurrency does not affect query execution speed to a major extent.
LUCENE
515673134.6
521061561.5
1033089729.9
503150276218.41
Initially searches are slow. But as we keep on searching the index is cached in RAM and the speed of searches increases. The searches are very fast as compared to MySQL. Here, from the results, it does not seem that concurrency is an issue. But i have been using lucene for some time now and have found that there are some concurrency issues in lucene. So if you are searching a huge index of say 100,00,000 records and the index size is say 8-10 GB, then with a concurrency of 100 searches at the same time, issues pop up, as searches seem to get locked. But still performance wise it is much better than mysql
SPHINX
515512102.4
521073373.3
10330227275.73
503150443929.59
Single searches are faster than that in lucene. But here we will have to consider the fact that there is no OR clause in the search. So the search engine does not have to get 2 different result sets and do a union on them. But as the concurrency of searches in increased the average time per search does not drop majorly as in lucene. Clearly pointing out that there may be concurrency issues here. Since i have not explored this to a great extent, i cannot comment on the problems related to concurrency here.


To sum up, lucene seems to be the right choice for the time being if you are looking forward to searching large amounts of data and performance is your primary goal. The list of features available is also impressive. Sphinx will come in next where indexing time is very small and indexing/searching hassle free. With evolution, sphinx may overtake lucene some time down the line providing both a list of good features and performance. MySQL fulltext search comes as a last option, which, it seems should be used only if the data set is small and quick development time is required.

6 comments:

Roland Bouman said...

Hey,

thank you!

One of the most important conclusions seems to be that MySQL fulltext search is not the right choice if you really need to rely on fast fulltext searching, but I keep wondering about the memory situation. At the very least, your article learns me that I should really care about memory when I'm planning on MySQL fulltext search, but I'm really curious what would happen if the machine would have loads and loads of memory. Did you have any indication that memory usage was in fact the limiting factor for MySQL fulltext search on your system?

kind regards,

Roland

Peter Zaitsev said...

Hi,

I would be careful with benchmarks. First you really should be using same stop word list and configuration as it can affect results dramatically.

I also would consider different matching strategies search engines use. MySQL and Lucene use probability matching by default, while Sphinx uses algorith which takes in account word position. This is much slower (could be 10 times) and requires much larger indexes as you have to store word positions rather than simply how frequent is word in the text. It sounds bad but it is what is required for good search quality - for example major search engines such as Google/Yahoo. account for phrases.

Yet another important thing to consider is size of data. In your case data was close to memory size which normally makes load CPU bound. For large sizes it is important how efficiently disk IO is done.

Finally There are different modes for SPHINX you can use "MATCH_ANY" if you want "OR" matching mode. And as I mentioned typically there is much more job performed finding full phrases and close keywords.

Peter Zaitsev said...

Hi,

I would be careful with benchmarks. First you really should be using same stop word list and configuration as it can affect results dramatically.

I also would consider different matching strategies search engines use. MySQL and Lucene use probability matching by default, while Sphinx uses algorith which takes in account word position. This is much slower (could be 10 times) and requires much larger indexes as you have to store word positions rather than simply how frequent is word in the text. It sounds bad but it is what is required for good search quality - for example major search engines such as Google/Yahoo. account for phrases.

Yet another important thing to consider is size of data. In your case data was close to memory size which normally makes load CPU bound. For large sizes it is important how efficiently disk IO is done.

Finally There are different modes for SPHINX you can use "MATCH_ANY" if you want "OR" matching mode. And as I mentioned typically there is much more job performed finding full phrases and close keywords.

Jayant Kumar said...

Before shifting our product to lucene, i used to run the product on mysql full text search engine. The servers that we use have loads of RAM and things used to run very well on them. But as soon as the data+index size (.MYI+.MYD) of mysql table exceeded the limit of physical RAM available, I started running into performance issues.

So it would be safe to say that if you have RAM > (.MYI+.MYD) and have configured mysql accordingly, mysql full text search would work well.

Jayant Kumar said...

Hello Andrew,

I know that i may not have done a fair comparison for sphinx. Since i
have been using lucene for about 2 years now, I know the ins and outs
of it. Whereas for sphinx, I have just started exploring it.

Pls find more comments in the mail itself.

--Jayant

On 6/3/06, Andrew Aksyonoff shodan@shodan.ru wrote:
> Hello Jayant,
>
> Thursday, June 1, 2006, 4:13:40 PM, you wrote:
> JK> I would like to try out the 0.9.6. Pls send me the cvs snapshot.
>
> I accidentally came across your blog entry comparing Sphinx
> and Lucene, and would like to add a few comments, as well as ask
> a few questions.
>
> First, comparing Lucene index which is using stopwords against
> Sphinx index which is NOT using stopwords is, well, a little unfair ;)

I agree to it. Including stop words will make a huge difference. I
will do that when i run sphinx on a large dataset.

>
> Second, there is an OR clause in Sphinx already. See SetMatchMode()
> API call and pass SPH_MATCH_ANY there. Also, not having to merge the
> results in SPH_MATCH_ALL mode is in fact of minor importance; the
> slower part is not merging the resuls sets or intersecting them.

Had passed SPH_MATCH_PHRASE. By OR clause i meant an "OR" between two
fields. Since you mention it, i could have done an OR between two
words in the index. The results would then have been comparable in a
better fashion.

>
> Third, the concurrency issues might very well be mistaken for
> caching issues. There's a known deficiency in Sphinx: it doesn't cache
> anything for now; it just depends on OS for file caching, and always
> does all the CPU work for each query. So if your test queries are the
> same, or use the same words, then it's highly probable that Lucene
> caches some intermediate results, while Sphinx does not. This is
> obviously a problem in Sphinx, and I'm of course going to fix that,
> so I'm really interested if the slowdown you experienced is a real
> concurrency issue or caching issue. Should probably finally do my
> own tests, though!

As far as i know lucene also does not do any query/result caching.
There is a .tis file, which, as far as i remember, is loaded into
memory. Which can result in increased performance. Pls correct me if i
am wrong.

>
> Now for the questions... :)
>
> I wonder if you are going to test the engine on some *really* large
> dataset, ie. 10-20 GB of text or something. With 500 M dataset, and
> 100-200 M index, it could just fit in memory, which is not the real
> world scenario with huge datasets.

Yes i would certainly do that. But it will take some time.

>
> I also wonder what was the ranking mode used with Lucene and how
> would you judge the relevance. If I'm not mistaken about Lucene
> defaults, Sphinx typically does more processing per query, because it
> always processes word positions, trying to rank matching phrases
> higher - relevance actually was my primary concern when initially
> developing Sphinx.

Lucene also does some relevance ranking. Though I have not gone into
depth of how it does that.

>
> Finally, I'd like to thank you for your benchmarking work. Your
> concurrency test was really interesting and really gives some food for
> thought. And... please do send me your results next time you
> benchmark! :)

Thanks :) . Will certainly do that...

>
> --
- Hide quoted text -
> Best regards,
> Andrew mailto:shodan@shodan.ru
>
>

Gaurav said...

Hi Jayant,

Good Effort! But I would also like to have a look at your scripts, which you used for benchmarking. Can you please provide your scripts also?

Gaurav

gnolkha AT gmail DOT com

From http://jayant7k.blogspot.com/2006/06/benchmarking-results-of-mysql-lucene.html

Mysql versus Lucene

Here is the comparison between mysql fulltext and lucene search engines. On the forefront the only thing that distinguishes one from another is

==> speed of fulltext search in lucene is much faster as compared to mysql
==> lucene is much more complex to use as compared to mysql.

In mysql, you can simply mark an index on a text/varchar column as fulltext, and your work is done. All you need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added. On the other hand, in lucene, the addition/modification of documents is to be handled programatically. Moreover Mysql is pluggable to your application using available apis. Mysql is running on a port and all you need to do is connect to the port and fire sql queries to that port using your application. Whereas in case of lucene, you will have to plug your application to the index using lucene api.

Another difference is that lucene is very efficient in searching large no of documents. Where as in case of mysql as the no of documents increases, the speed of search goes down. Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. Where as in case of lucene, the size of index does not affect the search performance even when the size exceeds the RAM on the system.

With mysql, when a fulltext index is created on a table, inserts on the table become very slow. Lets analyze this. Well... for each record, mysql does some major processing to fix the record in current index. After the record is indexed, the cache/RAM containing the index needs to be rebuilt, since the index which was previously there is not correct - does not have the new record. So, with each record Mysql fetches the complete index in the cache/RAM. So if you are performing search and inserts/updates on a single table with fulltext index, the performance of both search & indexing goes very very down. On the other hand, with lucene, addition of new documents is not a major drawback. Documents can be added on the fly. Which makes indexing very fast. And this process does not affect the search. Two things to be noted here.

==> lucene does not allow you to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.

==> lucene requires an object of the index to perform the search. You will know about it when you use the api. Whenever you add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.

With lucene, you do not have the flexibility to join two indexes and form a single query. Like in mysql you can do something like this

SELECT TABLE1.COLA, TABLE2.COLB FROM TABLE1, TABLE2 WHERE MATCH(TABLE1.COL1) AGAINST 'TEXT1' AND MATCH(TABLE2.COL2) AGAINST 'TEXT2' AND TABLE1.ID=TABLE2.FOREIGNKEY

(Pls dont see the syntax, look for the meaning/logic behind. I am not good at syntaxes. :-D ) This cannot be done with lucene. You will have to play with the data in such a way that your index contains both the data of say TABLE1 & TABLE2 and then you will have to play with the search to get the data that you need. Too complex right??

Also mysql comes with inbuilt list of stopwords and a default word tokenizer, which separates the words based on " ", ",", "." etc. Whereas in case of lucene, both - the list of stop words and the word tokenizer has to be defined by us. This is advantageous, because then you can define your own stopwords and tokenize the text as per your requirements.

In case of mysql the searches are by default case insensitive. Whereas in case of lucene, you can make the searches case-sensitive or case-insensitive, the way you want it to be.

With mysql you have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will you do if you want to index words like "php", "asp", "c"? You will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all your searches as a consequence. There are no such issues in lucene.

In mysql, every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight and if a word is rare, it has higher weight. So if a word is present in 50% of the rows in a table, a query searching for that word will result in 0 result. This, mysql terms as relevance. But for me, it resulted in incorrect results for a query.

This link http://dev.mysql.com/doc/connector/j/en/fulltext-search.html will give a better idea of mysql fulltext search.

In lucene, there are some advanced options like


  • proximity search - find documents where there is one word between searchword1 and searchword2

  • wildcard search - find documents which have word like searchword* or maybe search?word etc etc...

  • fuzzy/similarity searches - find documents with words sounding similar to roam~ (will look for roam, foam etc...)

  • Term boosting - you can boost a term to move relevant documents to the top. So for example, you can say that you want documents with word "lucene" to be more relevant than those with word "mysql". Then you can do something like -> lucene^4 mysql .


Sorting of results is extremely fast if you are using lucene. In case of mysql, if you expect your results to come out fast, you will have to forget sorting. Sorting takes huge amount of time. Even retrieving the total no of documents in mysql is a chore. Where as for lucene the total no of documents come out as a default.

From this, if you are looking for a fulltext index on a small table without much hassles, go for mysql fulltext index. Whereas if you are looking at searching a large collection of data then go for lucene.

Whew... i wrote a lot... And there is more to write...Maybe next time... Do let me know if i have missed anything.

8 comments:

Roland Bouman said...

Thanks for this very insightful comparison!

It was a pleasant read, and I think it gave me a graps of the differences between these two methods of fulltext indexing.

I've got a queston though, in a couple of places, you're telling something about performance issues: inserting new documents, the actual search and the sorting. Can you back this by a few figures? I'd be very grateful if you did, maybe in a sequal article?

Anyway, thanks again sofar.

noodl said...

Thanks Jayant. This will make a great explaination for that guy who asked on freenode#apache today about the differences. Let's hope he comes back again!

Jayant Kumar said...

I will be putting down some figures to the performance benchmarks soon...

pabloj said...

Has anyone considered placing CLucene (http://sourceforge.net/projects/clucene/) in an UDF, it's index in a blob field and using it to perform full text searches in MySQL? Or am I going the wrong path?

Jayant Kumar said...

As far as i know CLucene is just a replication of lucene-java version in c. The way indexing and search works for CLucene and for java-lucene is the same.

However what i have read is that CLucene
is 3 times faster than java-lucene due to the fact that the C code is closer to machine language without any intermediate virtual machine as in java.

Anonymous said...

Good overall comparison of Lucene vs Mysql fulltext search!

CLucene is not in C. It's an implementation of Lucene in C++. Apache foundation started an index-level compatible project in C for Lucene, codename lucene4c, but it hasn't got very far as CLucene.

Another great indexing toolkit which is very much in the same ballpark as Lucene but less fanfare exposure is Xapian http://www.xapian.org. Its searching time is very fast and its features are comparable to Lucene. It's also very mature as it used to be the core of a web search engine before being released as an open-source project.

Anonymous said...

Can Lucene index the text content which resides inside Mysql MyIsam tables, or "just" files?

Jayant Kumar said...

lucene can index any type of textual data whether it resides in mysql or in files or in any other database.

You will have to write down programs to fetch data from any data source be it mysql or db2 or text files and feed it into lucene to make the index