Thursday, July 25, 2013

Full-Text Search - Part 2

In my previous post Full-Text Search - Part 1, I discussed our reporting solution using NHibernate Search and MSMQ. NHibernate Search had been dead in the water for months when our administrators began complaining that user search results were extremely slow. We were using NHibernate to do a joined query on two tables with a "like" condition on four of the columns. Needless to say, it was not the most efficient operation.

Our development team decided to revisit Full-Text Search and scrap NHibernate Search. Instead, we wanted to develop a service that we could communicate with over HTTP that would update Lucene documents. We were researching how this could be done when we found Apache Solr. It did exactly what we wanted to do with our proposed service but also had the benefit of a strong user base. We scrapped the idea of creating our own service and decided to go with the proven solution instead.

Because of our unfamiliarity with the product, we postponed the automation of Solr installation for a later release and installed manually. Solr was written in Java and needed a Java servlet in which to run. Apache Tomcat was the recommended servlet, so we installed that first by following the installation prompt. Solr installation simply required that we copy the solr.war file from their download package into the lib folder of the Tomcat directory. All of the other entity-specific work was more easily automated.

The remaining automation tasks were generating the Extensible Markup Language (XML) files for the entities we were indexing and copying over configuration files. The Tomcat context XML files were required to be able to communicate with Tomcat the location of the indexes on the disk. The schema XML file was required as a contract for Solr to be able to correctly construct the Lucene document and for us to be able to communicate changes to those documents.

To avoid processing our complex object graph, we decided not to tie the index processing to our existing entities. Instead, we created a separate entity that only had the four fields we cared about: First Name, Last Name, Login Name and Primary Email. We wrote our own SolrFieldAttribute class that included all the flags needed to create the Solr schema XML file for the entity. We also created a SolrClassAttribute to indicate the classes for which to create the schema XML file and to create the Tomcat context file.

Initially, we only stored the four fields that were going to be displayed in the result list and that we were going to use to filter the results. This allowed us to process the documents for these lightweight entities much more quickly. Unfortunately, our system also had to restrict the users an administrator could see based on the administrator's permission set. To address this problem, we used Solr's multi-valued property to index all of the organizations in which the user had a subscription. From the code's standpoint, we added another property to the SolrFieldAttribute that indicated to our schema generator that it needed to add multiValued="true" to the schema field.

To provide the best user results list, we also had to add a field to the entity called Full Name that was the First Name and Last Name separated by a space. This would allow a user to enter "Sally Sunshine," and she would be the top user in the results list.

To make the update calls to Solr, we used a background job scheduling service that we already had in place. When an object changed that needed to be indexed, we created an entry in a table with the object's id and date updated. The scheduling service job, called the indexing job, then pulled rows from the table, converted them to Solr JSON objects and sent HTTP calls to Solr with the JSON object. The first problem we encountered was that the JSON used by Solr was not standard JSON. We had to write our own converter instead of using a standard converter like Newtonsoft. Notice in the examples below that Solr JSON does not handle arrays in the same way as standard JSON.

Example of standard JSON:

{
     "id":"a017a879-a2f2-4e37-811d-8f13ace56819",
     "firstname":"Sally",
     "lastname":"Sunshine",
     "fullname":"Sally Sunshine",
     "primaryemail":"ssunshine@sunshine.net",
     "loginname":"ssunshine",
     "organizationid":
     [
          "c04eb269-3a81-442f-85b5-23b4bd87d262",
          "87466c69-1b8f-4895-9483-5c1cb55ecb2b"
     ]
}

Example of Solr JSON:

{
     "id":"a017a879-a2f2-4e37-811d-8f13ace56819",
     "firstname":"Sally",
     "lastname":"Sunshine",
     "fullname":"Sally Sunshine",
     "primaryemail":"ssunshine@sunshine.net",
     "loginname":"ssunshine",
     "organizationid":"c04eb269-3a81-442f-85b5-23b4bd87d262",
     "organizationid":"87466c69-1b8f-4895-9483-5c1cb55ecb2b"
}

After successfully writing our speciaized JSON converter, we ran into problems with filtering the results. The beauty of Full-Text search was that it could return close results instead of just exact matches, but we were not getting the results we expected. The results returned by Solr were sorted by the score of how well the result matched the query criteria. To get the best possible match, we had to add weights to each of the filters. Exact matches got the most weight, split term matches got the next highest weight and wildcard matches got the least weight.

To avoid problems with case sensitivity, we added a Lowercase property to the SolrFieldAttribute that indicated to Solr to automatically add a lowercase version of the indexed field in the document. It did this by having two versions of the field in the schema XML file in addition to a copy field attribute. We also added a Tokenized property to the SolrFieldAttribute that worked in similar way but broke on spaces rather than making the field lowercase. The generated schema fields can be seen below.


<fields>
 <field name="indexdate" indexed="true" type="date" stored="true" multiValued="false" default="NOW"/>
 <field name="id" indexed="true" type="uuid" stored="true" required="true"/>
 <field name="firstname" indexed="true" type="string" stored="true"/>
 <field name="firstname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="firstname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="lastname" indexed="true" type="string" stored="true"/>
 <field name="lastname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="lastname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="fullname" indexed="true" type="string" stored="true"/>
 <field name="fullname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="fullname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="primaryemail" indexed="true" type="string" stored="true"/>
 <field name="primaryemail.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="primaryemail.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="loginname" indexed="true" type="string" stored="true"/>
 <field name="loginname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="loginname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="organizationid" indexed="true" type="uuid" stored="false" multiValued="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<copyField dest="firstname.lowercase" source="firstname"/>
<copyField dest="firstname.tokenized" source="firstname"/>
<copyField dest="lastname.lowercase" source="lastname"/>
<copyField dest="lastname.tokenized" source="lastname"/>
<copyField dest="fullname.lowercase" source="fullname"/>
<copyField dest="fullname.tokenized" source="fullname"/>
<copyField dest="primaryemail.lowercase" source="primaryemail"/>
<copyField dest="primaryemail.tokenized" source="primaryemail"/>
<copyField dest="loginname.lowercase" source="loginname"/>
<copyField dest="loginname.tokenized" source="loginname"/>

The initial indexing of entities in the database took some time. We wrote a separate job that went through the Users table and added rows to the update table that the indexing job was monitoring. After the initial indexing, each NHibernate update for a user inserted a row into the update table. Because the indexing job processed hundreds of entities at a time, the user documents closely mirrored the changes to the database.

Our administrators were very happy with the speed of our new Full-Text user management search. After performing a search, the administrator could then click the user's name in the list to be taken to the user's full profile. This increase in efficiency caused administrators to begin to wonder where else this magical solution could be applied.

In Part 3, we'll bring this series full-circle and discuss how Full-Text Search was finally able to make its way into our reporting system. Stay tuned!

About NexPort Solutions Group

NexPort Solutions Group is a division of Darwin Global, LLC, a systems and software engineering company that provides innovative, cost-effective training solutions and support for federal, state and local government, as well as the private sector.

0 comments :

Post a Comment

 
Copyright © . NexPort Solutions Engineering Blog - Posts · Comments
Theme Template by BTDesigner · Powered by Blogger