Friday, August 23, 2013

Full-Text Search - Part 3

In Full Text Search - Part 2, we discussed how we used bare-bones objects for the user management search. Unfortunately, our reporting system required a much more complex solution. Our administrators were becoming increasingly impatient with NexPort Campus' slow reporting interface. This was further compounded by the limited number of reportable data fields they were given. In an attempt to alleviate these concerns, we spiked out a solution using Microsoft Reporting Services as the backbone running on a separate server. After discovering the limitations of that system, we moved to using SQL views and replication. When replication failed again and again, we revisited Apache Solr for our reporting solution.

We began designing our Solr implementation by identifying the reportable properties we needed to support in our final object graph. The object graph included multiple levels of nesting. The most specific training record entity assignment status contained the section enrollment information, which in turn contained the subscription information, which in turn contained the user information. We wanted to be able to report on each level of the training tree. Because of the inherent flat document structure of Apache Lucene, it did not understand the complex nesting of our object graph. Our first idea was to flatten it all out.

public class User
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string FirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string LastName { get; set; }
}

public class Subscription
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; }
}

public class SectionEnrollment
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int EnrollmentScore { get; set; } // Cannot use Score, as that is used by Solr

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SubscriptionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; }
}

public class AssignmentStatus
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int StatusScore { get; set; } // Cannot use Score, as that is used by Solr

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid AssignmentId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionEnrollmentId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int SectionEnrollmentScore { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SubscriptionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; } 
}

This was an incredible amount of duplication, repetition and fragmentation. To add a reportable property for a user required a change to the subscription object, the section enrollment object and the assignment status object. The increased maintenance overhead and probability for making a typo was a potential deterrent to adding new reportable data to the system.

So, to keep our code DRY (Don't Repeat Yourself), we decided to mirror the nesting of our object graph by using objects and attribute mapping to generate the schema.xml for Solr. We populated the data by calling SQL stored procedures using NHibernate mappings. Because we used the same objects for populating as we did for indexing, we had to keep the associated entity IDs on the objects.

public class Subscription
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 public virtual Guid UserId { get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "user")]
 public virtual User User { get; set; }
}

public class SectionEnrollment
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int EnrollmentScore { get; set; } // Cannot use Score, as that is used by Solr

 public virtual Guid SectionId { get; set; } // Required for populate stored procedure

 public virtual Guid SubscriptionId { get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "subscription")]
 public virtual Subscription Subscription { get; set; }
}

public class AssignmentStatus
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int StatusScore { get; set; } // Cannot use Score, as that is used by Solr

 public virtual Guid AssignmentId { get; set; } // Required for populate stored procedure

 public virtual Guid EnrollmentId{ get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "enrollment")]
 public virtual SectionEnrollment Enrollment { get; set; }
}

This resulted in less code and achieved the same effect by adding "." separators to the schema.xml fields. For example, we used "enrollment.subscription.user.lastname" to signify the user's last name from the assignment status report. Because of this break from the JSON structure, we had to write our own parser for the results that Solr returned. We achieved this by tweaking the JSON parser we already had in place to accommodate "." separators rather than curly braces.

With our object graph finalized and the Solr implementation in place, we began to address the nested update locking issue we had discussed in Full-Text Search - Part 1. We solved this problem in the new system by adding SQL triggers and an update queue. When an entity was inserted, updated or deleted, the trigger inserted an entry into its queue table. Each entity had a separate worker process that processed its table queue and queued up related entities into entity-specific queue tables. This took the work out of the user's HTTP request and put it into a background process that could take all the time it required.

To lessen the user impact even more, the trigger just performed a straight insert into the queue table without checking if an entry already existed for that entity. This had a positive impact for the user but meant that Solr would be hammered with duplicate data. To avoid the unnecessary calls to Solr, we used a distinct clause in our SQL query that returned the top X number of distinct entities and recorded the time stamp of when it occurred. After sending the commands to Solr to update or delete the entity, it then deleted any entries in the queue table with the same entity ID that were inserted before the time stamp.

Solr full-text indexing, coupled with a robust change tracking queue and an easily-implemented attribute mapping system provided us with a solid reporting backend that could be used for all our reporting requirements. We still had to add an interface to use it, but most of the heavy lifting was done. Full-text search was implemented successfully!

About NexPort Solutions Group
NexPort Solutions Group is a division of Darwin Global, LLC, a systems and software engineering company that provides innovative, cost-effective training solutions and support for federal, state and local government, as well as the private sector.

0 comments :

Post a Comment

 
Copyright © . NexPort Solutions Engineering Blog - Posts · Comments
Theme Template by BTDesigner · Powered by Blogger