Showing posts with label solr. Show all posts
Showing posts with label solr. Show all posts

Friday, August 23, 2013

Full-Text Search - Part 3

- 0 comments
In Full Text Search - Part 2, we discussed how we used bare-bones objects for the user management search. Unfortunately, our reporting system required a much more complex solution. Our administrators were becoming increasingly impatient with NexPort Campus' slow reporting interface. This was further compounded by the limited number of reportable data fields they were given. In an attempt to alleviate these concerns, we spiked out a solution using Microsoft Reporting Services as the backbone running on a separate server. After discovering the limitations of that system, we moved to using SQL views and replication. When replication failed again and again, we revisited Apache Solr for our reporting solution.

We began designing our Solr implementation by identifying the reportable properties we needed to support in our final object graph. The object graph included multiple levels of nesting. The most specific training record entity assignment status contained the section enrollment information, which in turn contained the subscription information, which in turn contained the user information. We wanted to be able to report on each level of the training tree. Because of the inherent flat document structure of Apache Lucene, it did not understand the complex nesting of our object graph. Our first idea was to flatten it all out.

public class User
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string FirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string LastName { get; set; }
}

public class Subscription
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; }
}

public class SectionEnrollment
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int EnrollmentScore { get; set; } // Cannot use Score, as that is used by Solr

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SubscriptionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; }
}

public class AssignmentStatus
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int StatusScore { get; set; } // Cannot use Score, as that is used by Solr

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid AssignmentId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionEnrollmentId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int SectionEnrollmentScore { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SubscriptionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; } 
}

This was an incredible amount of duplication, repetition and fragmentation. To add a reportable property for a user required a change to the subscription object, the section enrollment object and the assignment status object. The increased maintenance overhead and probability for making a typo was a potential deterrent to adding new reportable data to the system.

So, to keep our code DRY (Don't Repeat Yourself), we decided to mirror the nesting of our object graph by using objects and attribute mapping to generate the schema.xml for Solr. We populated the data by calling SQL stored procedures using NHibernate mappings. Because we used the same objects for populating as we did for indexing, we had to keep the associated entity IDs on the objects.

public class Subscription
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 public virtual Guid UserId { get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "user")]
 public virtual User User { get; set; }
}

public class SectionEnrollment
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int EnrollmentScore { get; set; } // Cannot use Score, as that is used by Solr

 public virtual Guid SectionId { get; set; } // Required for populate stored procedure

 public virtual Guid SubscriptionId { get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "subscription")]
 public virtual Subscription Subscription { get; set; }
}

public class AssignmentStatus
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int StatusScore { get; set; } // Cannot use Score, as that is used by Solr

 public virtual Guid AssignmentId { get; set; } // Required for populate stored procedure

 public virtual Guid EnrollmentId{ get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "enrollment")]
 public virtual SectionEnrollment Enrollment { get; set; }
}

This resulted in less code and achieved the same effect by adding "." separators to the schema.xml fields. For example, we used "enrollment.subscription.user.lastname" to signify the user's last name from the assignment status report. Because of this break from the JSON structure, we had to write our own parser for the results that Solr returned. We achieved this by tweaking the JSON parser we already had in place to accommodate "." separators rather than curly braces.

With our object graph finalized and the Solr implementation in place, we began to address the nested update locking issue we had discussed in Full-Text Search - Part 1. We solved this problem in the new system by adding SQL triggers and an update queue. When an entity was inserted, updated or deleted, the trigger inserted an entry into its queue table. Each entity had a separate worker process that processed its table queue and queued up related entities into entity-specific queue tables. This took the work out of the user's HTTP request and put it into a background process that could take all the time it required.

To lessen the user impact even more, the trigger just performed a straight insert into the queue table without checking if an entry already existed for that entity. This had a positive impact for the user but meant that Solr would be hammered with duplicate data. To avoid the unnecessary calls to Solr, we used a distinct clause in our SQL query that returned the top X number of distinct entities and recorded the time stamp of when it occurred. After sending the commands to Solr to update or delete the entity, it then deleted any entries in the queue table with the same entity ID that were inserted before the time stamp.

Solr full-text indexing, coupled with a robust change tracking queue and an easily-implemented attribute mapping system provided us with a solid reporting backend that could be used for all our reporting requirements. We still had to add an interface to use it, but most of the heavy lifting was done. Full-text search was implemented successfully!

About NexPort Solutions Group
NexPort Solutions Group is a division of Darwin Global, LLC, a systems and software engineering company that provides innovative, cost-effective training solutions and support for federal, state and local government, as well as the private sector.
[Continue reading...]

Thursday, July 25, 2013

Full-Text Search - Part 2

- 0 comments
In my previous post Full-Text Search - Part 1, I discussed our reporting solution using NHibernate Search and MSMQ. NHibernate Search had been dead in the water for months when our administrators began complaining that user search results were extremely slow. We were using NHibernate to do a joined query on two tables with a "like" condition on four of the columns. Needless to say, it was not the most efficient operation.

Our development team decided to revisit Full-Text Search and scrap NHibernate Search. Instead, we wanted to develop a service that we could communicate with over HTTP that would update Lucene documents. We were researching how this could be done when we found Apache Solr. It did exactly what we wanted to do with our proposed service but also had the benefit of a strong user base. We scrapped the idea of creating our own service and decided to go with the proven solution instead.

Because of our unfamiliarity with the product, we postponed the automation of Solr installation for a later release and installed manually. Solr was written in Java and needed a Java servlet in which to run. Apache Tomcat was the recommended servlet, so we installed that first by following the installation prompt. Solr installation simply required that we copy the solr.war file from their download package into the lib folder of the Tomcat directory. All of the other entity-specific work was more easily automated.

The remaining automation tasks were generating the Extensible Markup Language (XML) files for the entities we were indexing and copying over configuration files. The Tomcat context XML files were required to be able to communicate with Tomcat the location of the indexes on the disk. The schema XML file was required as a contract for Solr to be able to correctly construct the Lucene document and for us to be able to communicate changes to those documents.

To avoid processing our complex object graph, we decided not to tie the index processing to our existing entities. Instead, we created a separate entity that only had the four fields we cared about: First Name, Last Name, Login Name and Primary Email. We wrote our own SolrFieldAttribute class that included all the flags needed to create the Solr schema XML file for the entity. We also created a SolrClassAttribute to indicate the classes for which to create the schema XML file and to create the Tomcat context file.

Initially, we only stored the four fields that were going to be displayed in the result list and that we were going to use to filter the results. This allowed us to process the documents for these lightweight entities much more quickly. Unfortunately, our system also had to restrict the users an administrator could see based on the administrator's permission set. To address this problem, we used Solr's multi-valued property to index all of the organizations in which the user had a subscription. From the code's standpoint, we added another property to the SolrFieldAttribute that indicated to our schema generator that it needed to add multiValued="true" to the schema field.

To provide the best user results list, we also had to add a field to the entity called Full Name that was the First Name and Last Name separated by a space. This would allow a user to enter "Sally Sunshine," and she would be the top user in the results list.

To make the update calls to Solr, we used a background job scheduling service that we already had in place. When an object changed that needed to be indexed, we created an entry in a table with the object's id and date updated. The scheduling service job, called the indexing job, then pulled rows from the table, converted them to Solr JSON objects and sent HTTP calls to Solr with the JSON object. The first problem we encountered was that the JSON used by Solr was not standard JSON. We had to write our own converter instead of using a standard converter like Newtonsoft. Notice in the examples below that Solr JSON does not handle arrays in the same way as standard JSON.

Example of standard JSON:

{
     "id":"a017a879-a2f2-4e37-811d-8f13ace56819",
     "firstname":"Sally",
     "lastname":"Sunshine",
     "fullname":"Sally Sunshine",
     "primaryemail":"ssunshine@sunshine.net",
     "loginname":"ssunshine",
     "organizationid":
     [
          "c04eb269-3a81-442f-85b5-23b4bd87d262",
          "87466c69-1b8f-4895-9483-5c1cb55ecb2b"
     ]
}

Example of Solr JSON:

{
     "id":"a017a879-a2f2-4e37-811d-8f13ace56819",
     "firstname":"Sally",
     "lastname":"Sunshine",
     "fullname":"Sally Sunshine",
     "primaryemail":"ssunshine@sunshine.net",
     "loginname":"ssunshine",
     "organizationid":"c04eb269-3a81-442f-85b5-23b4bd87d262",
     "organizationid":"87466c69-1b8f-4895-9483-5c1cb55ecb2b"
}

After successfully writing our speciaized JSON converter, we ran into problems with filtering the results. The beauty of Full-Text search was that it could return close results instead of just exact matches, but we were not getting the results we expected. The results returned by Solr were sorted by the score of how well the result matched the query criteria. To get the best possible match, we had to add weights to each of the filters. Exact matches got the most weight, split term matches got the next highest weight and wildcard matches got the least weight.

To avoid problems with case sensitivity, we added a Lowercase property to the SolrFieldAttribute that indicated to Solr to automatically add a lowercase version of the indexed field in the document. It did this by having two versions of the field in the schema XML file in addition to a copy field attribute. We also added a Tokenized property to the SolrFieldAttribute that worked in similar way but broke on spaces rather than making the field lowercase. The generated schema fields can be seen below.


<fields>
 <field name="indexdate" indexed="true" type="date" stored="true" multiValued="false" default="NOW"/>
 <field name="id" indexed="true" type="uuid" stored="true" required="true"/>
 <field name="firstname" indexed="true" type="string" stored="true"/>
 <field name="firstname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="firstname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="lastname" indexed="true" type="string" stored="true"/>
 <field name="lastname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="lastname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="fullname" indexed="true" type="string" stored="true"/>
 <field name="fullname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="fullname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="primaryemail" indexed="true" type="string" stored="true"/>
 <field name="primaryemail.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="primaryemail.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="loginname" indexed="true" type="string" stored="true"/>
 <field name="loginname.lowercase" indexed="true" type="lowercase" stored="false"/>
 <field name="loginname.tokenized" indexed="true" type="text_en" stored="false"/>
 <field name="organizationid" indexed="true" type="uuid" stored="false" multiValued="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<copyField dest="firstname.lowercase" source="firstname"/>
<copyField dest="firstname.tokenized" source="firstname"/>
<copyField dest="lastname.lowercase" source="lastname"/>
<copyField dest="lastname.tokenized" source="lastname"/>
<copyField dest="fullname.lowercase" source="fullname"/>
<copyField dest="fullname.tokenized" source="fullname"/>
<copyField dest="primaryemail.lowercase" source="primaryemail"/>
<copyField dest="primaryemail.tokenized" source="primaryemail"/>
<copyField dest="loginname.lowercase" source="loginname"/>
<copyField dest="loginname.tokenized" source="loginname"/>

The initial indexing of entities in the database took some time. We wrote a separate job that went through the Users table and added rows to the update table that the indexing job was monitoring. After the initial indexing, each NHibernate update for a user inserted a row into the update table. Because the indexing job processed hundreds of entities at a time, the user documents closely mirrored the changes to the database.

Our administrators were very happy with the speed of our new Full-Text user management search. After performing a search, the administrator could then click the user's name in the list to be taken to the user's full profile. This increase in efficiency caused administrators to begin to wonder where else this magical solution could be applied.

In Part 3, we'll bring this series full-circle and discuss how Full-Text Search was finally able to make its way into our reporting system. Stay tuned!

About NexPort Solutions Group

NexPort Solutions Group is a division of Darwin Global, LLC, a systems and software engineering company that provides innovative, cost-effective training solutions and support for federal, state and local government, as well as the private sector.
[Continue reading...]

Friday, June 28, 2013

Full-Text Search - Part 1

- 0 comments
The journey with full-text search has not been an easy one. Two and a half years ago, we began looking for a faster way to do searches across multiple related database tables. Our primary target was to improve the performance of reporting on student data in our Learning Management System, NexPort Campus. We were already using the Castle ActiveRecord library to map our C# classes to our database tables with attributes. ActiveRecord is built upon NHibernate, a popular .NET Object Relational Mapper (ORM). Because we were already used to attribute mapping, it made sense to try to use a similar toolset for full-text search. Enter NHibernate Search.

NHibernate Search was an extension of NHibernate built upon a .NET port of the Java full-text search engine Apache Lucene called Lucene.NET. Similar to ActiveRecord, NHibernate Search used attribute mapping to designate what should be included in the documents stored in the Lucene.NET's document database. (A document is a collection of text fields that is stored in a denormalized way to make queries faster.)

At the time of our design implementation, Lucene.NET was several versions behind its ancestor. This should have been a red flag, as was the fact that NHibernate Search had not had a new release in some time. Despite these troubling indicators, we plowed on. We started by mapping out all of the required properties to sustain our previous reporting SQL backend. Our model is quite complex, so this was no easy task. Primitive types and simple objects such as String and DateTime used a Field attribute and user-defined objects used a IndexedEmbedded attribute. In addition to the basic attributes required by NHibernate Search, we also had to write separate IFieldBridge implementations and include the FieldBridge attribute on each property. Needless to say, our class files exploded with non-intuitive code.

NHibernate Search used the attributes for their listeners to determine when an object changed and needed to be re-indexed. If a related object changed, it would then trigger the next object to be processed all in the same session. For our case, one of our indexed objects was a training record object, section enrollment. If a user object changed, it would trigger both itself and all related subscriptions to be re-indexed, which then triggered all section enrollments to be re-indexed. This led to a very large problem in our production system, which I will detail in a bit.

The whole idea of this undertaking was to decrease load on the database server while making search and reporting results faster. To that end, we put the indexing work on a separate machine. To communicate the documents to be indexed, we used Microsoft Messaging Queue (MSMQ) and wrote our own specific backend queue processor factory for NHibernate Search. When an object changed, it would be translated by NHibernate Search into a LuceneWork object. These LuceneWork objects were then serialized into a packet that MSMQ could handle. If the packet was too large, it was split into multiple packets and re-assembled on the other side. MSMQ worked fine when the machines were on the same domain. However, when we went to our Beta system, cross-domain issues began to crop up. After hours of research and trial-and-error, we finally were able to solve the problem by tweaking the Global Catalogue in the domain controller.

To make reads even faster, we implemented a Master-Slave relationship with our indexes. One master index was for writes, and there could be one or more slave indexes to read from. In our first attempt, we used Microsoft's Distributed File System (DFS) to keep the slaves updated from the master. We quickly ran into file-locking problems, so we went to a code-based synchronization solution. We used the Microsoft.Synchronization namespace to replicate the data, ignoring specific files that were causing locking problems.

The file synchronization code was the last piece of the puzzle. After spending months working on the new full-text search reporting backend, it was finally time to release the product. Remember the large problem I mentioned earlier? Well, as soon as users started logging into the system, the extra processing added by NHibernate Search brought the server to its knees. It took minutes for people to do something as simple as login to the system. We immediately had to turn off the listeners for NHibernate Search and re-release NexPort Campus. It was a complete and utter disaster.

The moral of this story is not that NHibernate Search is the devil. The main problem with this solution was over-engineering. Trying to avoid future work by cobbling together too many third-party components that just did not fit well together was short-sighted and ended up being more work in the end. It made for ugly code and an unmaintainable system.

In the weeks following our disastrous release, another developer and I began to think of ways to offload the extra processing. We had some good ideas for it and were in the process of making the changes when priorities changed. Full-text search sat in an unused, half-finished state for nearly two years. When the idea came up to improve the search capability and performance of our user management system, we revisited the full-text search solution. That's when we discovered the holy grail of full-text search, Apache Solr.

For the story of how Solr saved the day, please stay tuned for my next post, Full-Text Search - Part 2.

About NexPort Solutions Group NexPort Solutions Group is a division of Darwin Global, LLC, a systems and software engineering company that provides innovative, cost-effective training solutions and support for federal, state and local government, as well as the private sector.
[Continue reading...]
 
Copyright © . NexPort Solutions Engineering Blog - Posts · Comments
Theme Template by BTDesigner · Powered by Blogger