Sunday, April 12, 2009

Search and Cluster Search Result with Lucene, Solr and Carrot

Background
these days I've been working on "search part" of an website in PHP. We want to put more effort on this part to get this site outstanding from the similar ones. So the goal is to have a QUICK, SCALABLE, FAST-TO-GO search engine to get an EASY-TO-DIG result. (btw, i am not satisfied with PHP search ways)
QUICK: be quick to fetch search result
SCALABLE: be easy to expend the search functionality (pre-search, search , post-search, even for performance)
FAST-TO-GO
:be easy to implement with short study curve
EASY-TO-DIG
: from the users' perspective, the search result should be easy to use (like clustering the result)

with these in mind, i prefer a separate application in JAVA to do that. soon later, it was turned out with this idea:
Lucene as the lower level index library(tool) . version 2.4.1
SOLR as the kinda middle ware to maintain lucene index (indexing, searching, updating, monitoring, analysis ...etc). version 1.3.0
CARROT as the part for clustering and rendering the result (in this example, i custumized the fancy ui comes with carrot). version 3.0.1

the rest of this article will be devided into 4 parts :
1. Bacis of Lucene : to give a brief introduction to Lucene
2. Solr: what is solr, why we use it, how to use it, how to generate index file from exsiting data in database, make your own field type(the support for chinese charcter)
3. Carrot: what is carrot, how to get it working with different source (in this case is Solr)
4. Conclustions: what you can do with these three differently and what can be done more.

here we goooooo ...

Lucene
(java)
Since lucene is famous enough, i just give basic introduction to it.
Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform which provides indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
So you can see it as a lower level libaray.
basically there are two things you have to care about using lucenen: index and search

Index
Like the index in databse, the more differently it behavires on a field with or w/o index, the more data you have. So is Lucene index ! indexing is to convert your data source into a certain type of document that is very suitable for search. Lucene maintains a table of keywords and each keyword links to a bunch of articles that contain it. so it searches fast.

Search
The searching is to give some keyworks and to find these related articles based on the index document.

Core classes
here are some core classes you need to know:
Document - represents a record in index, contains a list of key-value pair fields.
Term - represents a word, Lucene's unit of indexing.
TermDocs - to tell where contain a given Term.
Directory - represents the location where the index is stored. like file system, memory, db....etc
IndexReader - the core class to do search
IndexWriter - the core class to do index
Hits - represents the search result.

given these classes, we can write a test index-search. the index snippet looks like:
Analyzer sAnalyzer = new StandardAnalyzer(); IndexWriter indexWriter = new IndexWriter("pathToIndexDir",sAnalyzer,true); Document document = new Document(); Reader ri = new FileReader("pathToFIleToBeIndexed"); document.add(Field.Text("name", "fileName")); document.add(Field.Text("contents",ri)); indexWriter.addDocument(document); indexWriter.close();
and the search part:
IndexSearcher searcher = new IndexSearcher(FSDirectory.getDirectory("pathToFIle")); Term term = new Term("contents", "this_is_the_keyword_i_wanna_search_for"); TermQuery q = new TermQuery(term); Hits hits = searcher.search(q);
above is a short talk about lucene, if you want know more about lucene, check the resources in the end of this article.

now you have the weapon to fight more. Let's move on.

SOLR
The apache says: Solr is an enterprise-ready, open source enterprise search server based on the Lucene search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as jetty and tomcat.

On the benefits from the nature support of lucene, we got no worry about integrating Lucene in solr. Solr handles the communication to lucene.

What we need solr for
you can think solr either as a search server or as a framework from a develper's view. With solr, you can create index, update index, delete index, do the search based on sophisticated query, render the search result in xml or json format, do analysis, and even import and extract your different data source to index file (generate index from database or webpage for instance )...etc.

i think these features will attract you, absolutely! is it complicated to implement? NO!Nop, almost everything can be done throught configuration. coool.

Install and run
i am not gonna describe it step by step like the other articles do, for that detailed steps , you can check wiki of solr. i conclude the expriences and share them here.

in this article we use Solr 1.3.0. Grab solr from it's homepage. unzip it to "~/apache-solr-1.3.0". find "dist" folder and copy "apache-solr-1.3.0.war" from it to your servlet container, in this case is tomcat6. So i copy it to tomcat_home/webapp folder and rename it to solr.war. This is a war file, so after we start tomcat, this application will be deployed and accessable through url localhost:8080/solr. if tomcat is configed as default, the war file will be unzipped to a folder with the name of the war file. this application is the solr search server. in most cases we dont need to change it, what we need to do is to start/stop tomcat:). goto your tomcat_home, run "./bin/start.sh", after tomcat starts up, point your browser to http://localhost:8080/solr/admin/ it should show up as following:

this is the admin tool for solr, we will know more about it later. actually it's self explaining enough.

before moving ahead, please check your tomcat_home, do you find a newly created folder called "solr'?(if you run tomcat script from tomcat_home). this folder is called solr home where solr use to keep index file and all the other configuration files.

Config the Solr
To config solr, there are one home and two files to care about: solr home ; schema.xml and solrconfig.xml

solr home

we have installed the server(application), but how do we customize it ? since every one want solr for different behavior.

The point is solr needs a solr_home to run, yep, the solr home folder contains all customized files. basically you should copy "solr" folder from "apache-solr-1.3.0/example/" to anywhere you prefer working as solr home (eg ~/solr_home). this is the standard, clean solr home you can reuse for your project.

Point to solr home
how do you let your solr app know where is the solr home? there are 3 ways to go:
1. Set the java system property solr.solr.home to your solr home.
2. Configure the servlet container such that a JNDI lookup of "java:comp/env/solr/home" by the solr webapp will point to the solr home
3. The default solr home is "solr" under the JVM's current working directory ($CWD/solr), so start the servlet container in the directory containing ./solr

the third is the reason why we got solr folder under tomcat_home as referred above. in our case, we use the second one, edit your context file in tomcat (eg: tomcat_home/conf/Catalina/localhost/solr.xml)
now stop tomcat , del solr folder from tomcat_home and restart tomcat, you will find solr uses ~/solr_home as home. open http://localhost:8080/solr/admin/ it will tell you the home at the top of the page.

you have your own solr_home, explore it , you will find the following structure:
bin -- contains a lot commond line script
conf -- this is the one we need , it contains the configuration files: schema.xml and solrconfig.xml

schema.xml
just like in database, each row in a table contains many columns. each column has it's own type, like varchar , int, blob...... in Lucene part, we know each document in index contains a lot fields and each filed has type also. how many field types solr support , how many fields (columns) in document (table) , which field is in which type ...etc is defined in this schema.xml. just like database schema.

this xml has very nice comments. after reading it, you can have the idea of what each tag indicates. the most important ones :
-- this is the root tag.
-- the first part of this file is to declare all field types solr support.

19
20
21

22

like text_ws type, here ws stands for white-space. A text field that only splits on whitespace for exact matching of words
-- the second part of this file is to declare how many fields (column) in document (table). in other word, how may information you want to let solr/lucene keep and how do you want solr/lucene to do with these information.

-- Field to use to determine and enforce document uniqueness. like the primary key in database.

okii, lets take a concrete example. we have a blog application and we want solr working as the seach engine (or application). we will use this example througout this article.

first we need to let solr know the schema, in other words, need to let solr understand the data structure. solr comes with a lot field types, so we dont need to make a new one(later i will show you how to define your own type , eg to deal with chinese charaters), we just declare fields (just no need to make your own data type, just declare your columns in a table).
name -- mandatory - the name for the field, we will use later in another config file
type -- type: mandatory - the name of a previously defined type from the section indexed -- indexed: true if this field should be indexed (searchable or sortable) stored -- true if this field should be retrievable multiValued -- true if this field may contain multiple values per document so the above says, solr/lucene will keep information of id, link, subject, content,author, createTime,tags. some of them need to be indexed, some of them need to be stored otherwise, you wont get the information in the search result. notice the tags is of multiValued="true". it might have more value with one document. with these defined, we can say, i want to seach title with the word of "jacky", or i want to search content with the word of "jacky". solr provides us an easy way to inculde all feilds information into a feild, all we need is a special field:
with the new field "all_text", we can search "title" and "content" with the word of "jacky" in one time, since solr/lucene put all these 2 information togerther while indexing. at the end of schema.xml
id all_text
we say, the primary field is "id", when solr search, it uses "all_text" as the target field if not specified. up to now, we let solr know how to deal with the information, then we introduct solr how to run. solrconfig.xml this config file is about solr running environment itself. not the index part. same as schema.xml, this file has nice comments also. instead of ctrl+c and ctrl+v, i juse refer some points we need in this case:
${solr.data.dir:/home/jacky/solr_home/data}
this tells solr where to store the index files. it can be outside solr home. and one more part we can use is for importinng data which we will talk soon in extension part.

now we have all config done, actually we can start tomcat and give it a seach. but the problem is we haven't got any article indexed. in next section, we talk about it.(Data import)

before continue, we need to keep in mind, most of the solr operation
can be triggered from http. like http://localhost:8080/solr/select/?q=mykeyword&start=0&rows=10&indent=on, this is for seaching,for creating index, import data works in the same way.

Extension

Data import assume we have the blog app already and in the database there are 1000 article. How do we index them? emmm.... write a small script (or application) to get articles out from db and index them? nonono.... that is where solr DataImportHandler comes from.

Most applications store data in relational databases or XML files and searching over such data is a common use-case. The DataImportHandler is a Solr contrib that provides a configuration driven way to import this data into Solr in both "full builds" and using incremental delta imports.

as we said, open your solrconfig.xml, add one requestHandler tag:



db-data-config.xml


here we define a request handler which handles data import from db. the database source information is kept in db-data-config.xml in the same level of solrconfig.xml.

let's check it out.

1
2
3
4 '${dataimporter.delta.id}'">
6 transformer="com.jacky.search.handler.dataimport.ConvertTagsToWsTextTransformer">
7
8
9
10
13 />
19
20
21

22

23
line 2 defines the data source info. line 4 tell solr which query to be executed to get the data from db while it creates the index for the first time, or which one to be executed to get the data for incremental delta imports. the rest field tag indicates the mapping between database column and the field defined in solr schema.xml file.


Run Solr

up to now, we have all config file done, we start tomcat with the customized config files. open browser and point to http://localhost:8080/solr/admin/stats.jsp, in CORE section , you can see nothing got indexed. nothing returns when you do search

Import data
we are gonna import data from database into solr and index them.
since the command can be passed through http. open browse and point to http://localhost:8080/solr/data_import?command=full-import&clean=true
take a look at this url, the "data_import" is the name of the requestHandler defined in solrconfig.xml. with this, solr will grab data from db defined in db-data-config.xml, and index them based on fields defined in schema.xml. it should be fast to index them. while that process is under going, you can execute http://localhost:8080/solr/data_import?command=status for check the index process status, how many is done....etc.

after it is done, go back to http://localhost:8080/solr/admin/stats.jsp, it will show how many docs indexed and where it is saved.



to be continued..... all xml snippets are gone by accident :(

Monday, April 6, 2009

tomcat connector encoding

URIEncoding

This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

useBodyEncodingForURI

This specifies if the encoding specified in contentType should be used for URI query parameters, instead of using the URIEncoding. This setting is present for compatibility with Tomcat 4.1.x, where the encoding specified in the contentType, or explicitely set using Request.setCharacterEncoding method was also used for the parameters from the URL. The default value is false.


be careful to use useBobyEncodingForURI, it overwites URIEncoding

Sunday, April 5, 2009

Light software stack for legacy : JOSH

quick blog

gray lens man proposed a software stack : JOSH

cited from his article:

Json delivers on what XML promised. Simple to understand, effective data markup accessible and usable by human and computer alike. Serialization/Deserialization is on par with or faster then XML, Thrift and Protocol Buffers. Sure I'm losing XSD Schema type checking, SOAP and WS-* standardization. I'm taking that trade.

OSGi a standardized dynamic, modular framework for versioned components and services. Pick a logger component, a HTTP server component, a ??? component, add your own internal components and you have a dedicated application solution. Micro deployment with true replacement. What am I giving up? The monolithic J2EE application servlet loaded with 25 frameworks, SCA and XML configuration hell. Taking the trade.

HTTP is simple, effective, fast enough, and widely supported. I'm tired of needlessly complex and endless proprietary protocols to move simple data from A to B with all the accompanying firewall port insanity. Yes, HTTP is not perfect. But I'm taking this trade where I can as well.

All interfaces will be simple REST inspired APIs based on HTTP+JSON. This is an immediate consequence of the JOSH stack.

Scala is by far the toughest, yet the easiest selection in the JOSH stack. I wrestled far more with the JSON or XML or Thrift or Protocol Buffers decision.



that's cool to be used for solving legacy problems

JOSH, or JOSCH where C could stand for eight "scala" or "coughDB" which can be used to integrated persistent layer.


read the original "stack proposal": http://thegreylensmansview.blogspot.com/2009/02/book-of-josh.html

Tuesday, March 3, 2009

Symfony VS Struts in Java

Alvaro has a very nice article about symfony performance and features. I just commented over it. it is a bit long, so i leave a copy here.

This article is about Symfony in php and Struts in Java. (they are compared in the same order that Alvaro did in his article. )

compared with struts(the mvc framework in Java), i like symfony more. Because it's easier to use and faster to implement. Maybe we would give thanks to the agile nature of PHP.

But both of them have some same features more or less:

the Factories: In struts 1, it depended on the well-form nature of java. you can extend or implement classes from the framework core classes. With struts 2, it uses the famous "Spring" lib(optionally , xwork has his own also). The concept of Ioc (inverse on control) give developers a better chance to extend and replace the logic of the original framework without changing any code. actually ioc is everywhere in struts 2 from populating request parameters to actions to getting a DAO object.
If symforny could introduce Ioc, i think it will be more attractive.

Filter Chain: 5 years ago,(struts 1 for instance), we just used filter in Java. here filter is a real filter which basically pre-excute some logic before the main one. But in struts 2, it changes the name to interceptors (interceptor chain) which do the same as "Filter chain" in symfony.
But what can be improved is : symfony should have a interface-like class (maybe abstract class) who contains 2 method : preExecute() and postExecute(), then any subclass 's logic is much clear. Since i've seen some bug caused by putting post-execute-logic wrongly before $filterChain->execute(); by mistake.

Configuration Cascade: very nice feature. struts should learn more from this(to improve at least).
But now grails (something like symfony,rails in Groovy running in JVM) has this. btw rails 1.2 's performance is much better than before.

Plugin: as developer, u can make your own plugin easily for both frameworks. and the plugin repository has a lot of nice off-the-shelf plugins.
since symfony auto-load mechanism is not as robust as java's, it will loose for performance sake.

Controller adaptability: as said before, with Ioc you can easily extend it to meet your need.

View: same as Controller. in java, you can also choose which tech you would like to render the page. like freemarker, Velocity, or the traditional JSP page. very flexible, right?


why i like symfony more:
1. it has command-line utilities which speed up development and save you from the boring configurations and machine steps.
(grails has the same functionality, so i like grails as well :))

2. symfony cooperates very well with html tags and Ajax. i have bad impress of ajax support in struts and not good experience with tags. Symfony does it really good. Especially for ajax support, the built-in is for prototype, and you can easily switch to jquery using plugin.

there is a lot to say, like the model level, performance tuning, memcache......

hope we can talk more shortly.

Wednesday, February 4, 2009

301 in header doesn't work, return 200 instead

in one opensource project, at almost end of the script , it says
header("Location: $url", 0, 301);exit;
basically, it would work as expected.

but as result, the page doesn't redirect to new page. lookin deeply this response returned 200 instead of 301.

(hacking...searching.....)

in some place before the one above, there is
header('Status: 200 OK');
so this one stops later 301. But why 301 doesnt overwrite 200.... check the doc it says it will replace similar XXX. so after replacing "header("Location: $url", 0, 301);" with
header("Location: $url");
header('Status: 301 Moved Permanently');
it works fine.

///////////////////////////////

in some cases, it needs to determine if it's cgi or not
else if (SAPI_NAME == 'cgi' OR SAPI_NAME == 'cgi-fcgi')
{
header("Location: $url");
// Call the status header after Location so we are sure to wipe out the 302 header sent by PHP
header('Status: 301 Moved Permanently');
}
else
{
header("Location: $url");
header('HTTP/1.1 301 Moved Permanently');
}

cgi gives greater security, it runs under certain user's privilege, but slow (for shared server);
module is faster but runs under the same permission of web server.

Monday, February 2, 2009

symfony 1.2 propel-build-model Bus error

run 'symfony propel-build-model' ends with 'Bus error'

if it doesnt complain your libs, you can go to your scheme.yml file.

i found this is cause by one field type in db with defual value:

i have a column called 'created_at' of type: timestamp with default value 'CURRENT_TIMESTAMP'. this is on mysql level. well enough!

run 'propel-build-schema'

you will get
created_at: { type: TIMESTAMP, required: true, defaultValue: CURRENT_TIMESTAMP }
then, run 'propel-build-model', you would get 'Bus error'

propel generator might not know 'CURRENT_TIMESTAMP'. after removing this from scheme.yml, you can run that script smoothly.

good luck. everytime when you get problems, just look back. the BACK at this moment is so called 'experience'.