Sunday, April 12, 2009

Search and Cluster Search Result with Lucene, Solr and Carrot

Background
these days I've been working on "search part" of an website in PHP. We want to put more effort on this part to get this site outstanding from the similar ones. So the goal is to have a QUICK, SCALABLE, FAST-TO-GO search engine to get an EASY-TO-DIG result. (btw, i am not satisfied with PHP search ways)
QUICK: be quick to fetch search result
SCALABLE: be easy to expend the search functionality (pre-search, search , post-search, even for performance)
FAST-TO-GO
:be easy to implement with short study curve
EASY-TO-DIG
: from the users' perspective, the search result should be easy to use (like clustering the result)

with these in mind, i prefer a separate application in JAVA to do that. soon later, it was turned out with this idea:
Lucene as the lower level index library(tool) . version 2.4.1
SOLR as the kinda middle ware to maintain lucene index (indexing, searching, updating, monitoring, analysis ...etc). version 1.3.0
CARROT as the part for clustering and rendering the result (in this example, i custumized the fancy ui comes with carrot). version 3.0.1

the rest of this article will be devided into 4 parts :
1. Bacis of Lucene : to give a brief introduction to Lucene
2. Solr: what is solr, why we use it, how to use it, how to generate index file from exsiting data in database, make your own field type(the support for chinese charcter)
3. Carrot: what is carrot, how to get it working with different source (in this case is Solr)
4. Conclustions: what you can do with these three differently and what can be done more.

here we goooooo ...

Lucene
(java)
Since lucene is famous enough, i just give basic introduction to it.
Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform which provides indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
So you can see it as a lower level libaray.
basically there are two things you have to care about using lucenen: index and search

Index
Like the index in databse, the more differently it behavires on a field with or w/o index, the more data you have. So is Lucene index ! indexing is to convert your data source into a certain type of document that is very suitable for search. Lucene maintains a table of keywords and each keyword links to a bunch of articles that contain it. so it searches fast.

Search
The searching is to give some keyworks and to find these related articles based on the index document.

Core classes
here are some core classes you need to know:
Document - represents a record in index, contains a list of key-value pair fields.
Term - represents a word, Lucene's unit of indexing.
TermDocs - to tell where contain a given Term.
Directory - represents the location where the index is stored. like file system, memory, db....etc
IndexReader - the core class to do search
IndexWriter - the core class to do index
Hits - represents the search result.

given these classes, we can write a test index-search. the index snippet looks like:
Analyzer sAnalyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter("pathToIndexDir",sAnalyzer,true); Document document = new Document(); Reader ri = new FileReader("pathToFIleToBeIndexed"); document.add(Field.Text("name", "fileName")); document.add(Field.Text("contents",ri)); indexWriter.addDocument(document); indexWriter.close();
and the search part:
IndexSearcher searcher = new IndexSearcher(FSDirectory.getDirectory("pathToFIle")); Term term = new Term("contents", "this_is_the_keyword_i_wanna_search_for"); TermQuery q = new TermQuery(term); Hits hits = searcher.search(q);
above is a short talk about lucene, if you want know more about lucene, check the resources in the end of this article.

now you have the weapon to fight more. Let's move on.

SOLR
The apache says: Solr is an enterprise-ready, open source enterprise search server based on the Lucene search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as jetty and tomcat.

On the benefits from the nature support of lucene, we got no worry about integrating Lucene in solr. Solr handles the communication to lucene.

What we need solr for
you can think solr either as a search server or as a framework from a develper's view. With solr, you can create index, update index, delete index, do the search based on sophisticated query, render the search result in xml or json format, do analysis, and even import and extract your different data source to index file (generate index from database or webpage for instance )...etc.

i think these features will attract you, absolutely! is it complicated to implement? NO!Nop, almost everything can be done throught configuration. coool.

Install and run
i am not gonna describe it step by step like the other articles do, for that detailed steps , you can check wiki of solr. i conclude the expriences and share them here.

in this article we use Solr 1.3.0. Grab solr from it's homepage. unzip it to "~/apache-solr-1.3.0". find "dist" folder and copy "apache-solr-1.3.0.war" from it to your servlet container, in this case is tomcat6. So i copy it to tomcat_home/webapp folder and rename it to solr.war. This is a war file, so after we start tomcat, this application will be deployed and accessable through url localhost:8080/solr. if tomcat is configed as default, the war file will be unzipped to a folder with the name of the war file. this application is the solr search server. in most cases we dont need to change it, what we need to do is to start/stop tomcat:). goto your tomcat_home, run "./bin/start.sh", after tomcat starts up, point your browser to http://localhost:8080/solr/admin/ it should show up as following:

this is the admin tool for solr, we will know more about it later. actually it's self explaining enough.

before moving ahead, please check your tomcat_home, do you find a newly created folder called "solr'?(if you run tomcat script from tomcat_home). this folder is called solr home where solr use to keep index file and all the other configuration files.

Config the Solr
To config solr, there are one home and two files to care about: solr home ; schema.xml and solrconfig.xml

solr home

we have installed the server(application), but how do we customize it ? since every one want solr for different behavior.

The point is solr needs a solr_home to run, yep, the solr home folder contains all customized files. basically you should copy "solr" folder from "apache-solr-1.3.0/example/" to anywhere you prefer working as solr home (eg ~/solr_home). this is the standard, clean solr home you can reuse for your project.

Point to solr home
how do you let your solr app know where is the solr home? there are 3 ways to go:
1. Set the java system property solr.solr.home to your solr home.
2. Configure the servlet container such that a JNDI lookup of "java:comp/env/solr/home" by the solr webapp will point to the solr home
3. The default solr home is "solr" under the JVM's current working directory ($CWD/solr), so start the servlet container in the directory containing ./solr

the third is the reason why we got solr folder under tomcat_home as referred above. in our case, we use the second one, edit your context file in tomcat (eg: tomcat_home/conf/Catalina/localhost/solr.xml)
now stop tomcat , del solr folder from tomcat_home and restart tomcat, you will find solr uses ~/solr_home as home. open http://localhost:8080/solr/admin/ it will tell you the home at the top of the page.

you have your own solr_home, explore it , you will find the following structure:
bin -- contains a lot commond line script
conf -- this is the one we need , it contains the configuration files: schema.xml and solrconfig.xml

schema.xml
just like in database, each row in a table contains many columns. each column has it's own type, like varchar , int, blob...... in Lucene part, we know each document in index contains a lot fields and each filed has type also. how many field types solr support , how many fields (columns) in document (table) , which field is in which type ...etc is defined in this schema.xml. just like database schema.

this xml has very nice comments. after reading it, you can have the idea of what each tag indicates. the most important ones :
-- this is the root tag.
-- the first part of this file is to declare all field types solr support.

19
20
21

22

like text_ws type, here ws stands for white-space. A text field that only splits on whitespace for exact matching of words
-- the second part of this file is to declare how many fields (column) in document (table). in other word, how may information you want to let solr/lucene keep and how do you want solr/lucene to do with these information.

-- Field to use to determine and enforce document uniqueness. like the primary key in database.

okii, lets take a concrete example. we have a blog application and we want solr working as the seach engine (or application). we will use this example througout this article.

first we need to let solr know the schema, in other words, need to let solr understand the data structure. solr comes with a lot field types, so we dont need to make a new one(later i will show you how to define your own type , eg to deal with chinese charaters), we just declare fields (just no need to make your own data type, just declare your columns in a table).
name -- mandatory - the name for the field, we will use later in another config file
type -- type: mandatory - the name of a previously defined type from the section indexed -- indexed: true if this field should be indexed (searchable or sortable) stored -- true if this field should be retrievable multiValued -- true if this field may contain multiple values per document so the above says, solr/lucene will keep information of id, link, subject, content,author, createTime,tags. some of them need to be indexed, some of them need to be stored otherwise, you wont get the information in the search result. notice the tags is of multiValued="true". it might have more value with one document. with these defined, we can say, i want to seach title with the word of "jacky", or i want to search content with the word of "jacky". solr provides us an easy way to inculde all feilds information into a feild, all we need is a special field:
with the new field "all_text", we can search "title" and "content" with the word of "jacky" in one time, since solr/lucene put all these 2 information togerther while indexing. at the end of schema.xml
id all_text
we say, the primary field is "id", when solr search, it uses "all_text" as the target field if not specified. up to now, we let solr know how to deal with the information, then we introduct solr how to run. solrconfig.xml this config file is about solr running environment itself. not the index part. same as schema.xml, this file has nice comments also. instead of ctrl+c and ctrl+v, i juse refer some points we need in this case:
${solr.data.dir:/home/jacky/solr_home/data}
this tells solr where to store the index files. it can be outside solr home. and one more part we can use is for importinng data which we will talk soon in extension part.

now we have all config done, actually we can start tomcat and give it a seach. but the problem is we haven't got any article indexed. in next section, we talk about it.(Data import)

before continue, we need to keep in mind, most of the solr operation
can be triggered from http. like http://localhost:8080/solr/select/?q=mykeyword&start=0&rows=10&indent=on, this is for seaching,for creating index, import data works in the same way.

Extension

Data import assume we have the blog app already and in the database there are 1000 article. How do we index them? emmm.... write a small script (or application) to get articles out from db and index them? nonono.... that is where solr DataImportHandler comes from.

Most applications store data in relational databases or XML files and searching over such data is a common use-case. The DataImportHandler is a Solr contrib that provides a configuration driven way to import this data into Solr in both "full builds" and using incremental delta imports.

as we said, open your solrconfig.xml, add one requestHandler tag:



db-data-config.xml


here we define a request handler which handles data import from db. the database source information is kept in db-data-config.xml in the same level of solrconfig.xml.

let's check it out.

1
2
3
4 '${dataimporter.delta.id}'">
6 transformer="com.jacky.search.handler.dataimport.ConvertTagsToWsTextTransformer">
7
8
9
10
13 />
19
20
21

22

23
line 2 defines the data source info. line 4 tell solr which query to be executed to get the data from db while it creates the index for the first time, or which one to be executed to get the data for incremental delta imports. the rest field tag indicates the mapping between database column and the field defined in solr schema.xml file.


Run Solr

up to now, we have all config file done, we start tomcat with the customized config files. open browser and point to http://localhost:8080/solr/admin/stats.jsp, in CORE section , you can see nothing got indexed. nothing returns when you do search

Import data
we are gonna import data from database into solr and index them.
since the command can be passed through http. open browse and point to http://localhost:8080/solr/data_import?command=full-import&clean=true
take a look at this url, the "data_import" is the name of the requestHandler defined in solrconfig.xml. with this, solr will grab data from db defined in db-data-config.xml, and index them based on fields defined in schema.xml. it should be fast to index them. while that process is under going, you can execute http://localhost:8080/solr/data_import?command=status for check the index process status, how many is done....etc.

after it is done, go back to http://localhost:8080/solr/admin/stats.jsp, it will show how many docs indexed and where it is saved.



to be continued..... all xml snippets are gone by accident :(

2 comments:

  1. I want to know how we can integrate this functionality with php. Once search button is pressed we need only results. How this an be achieved?

    Mark

    ReplyDelete
  2. How to configure Carrot with solr and How to render the search results from carrot in a web page

    ReplyDelete