Friday, August 23, 2013

Full-Text Search - Part 3

- 0 comments
In Full Text Search - Part 2, we discussed how we used bare-bones objects for the user management search. Unfortunately, our reporting system required a much more complex solution. Our administrators were becoming increasingly impatient with NexPort Campus' slow reporting interface. This was further compounded by the limited number of reportable data fields they were given. In an attempt to alleviate these concerns, we spiked out a solution using Microsoft Reporting Services as the backbone running on a separate server. After discovering the limitations of that system, we moved to using SQL views and replication. When replication failed again and again, we revisited Apache Solr for our reporting solution.

We began designing our Solr implementation by identifying the reportable properties we needed to support in our final object graph. The object graph included multiple levels of nesting. The most specific training record entity assignment status contained the section enrollment information, which in turn contained the subscription information, which in turn contained the user information. We wanted to be able to report on each level of the training tree. Because of the inherent flat document structure of Apache Lucene, it did not understand the complex nesting of our object graph. Our first idea was to flatten it all out.

public class User
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string FirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string LastName { get; set; }
}

public class Subscription
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; }
}

public class SectionEnrollment
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int EnrollmentScore { get; set; } // Cannot use Score, as that is used by Solr

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SubscriptionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; }
}

public class AssignmentStatus
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int StatusScore { get; set; } // Cannot use Score, as that is used by Solr

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid AssignmentId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionEnrollmentId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int SectionEnrollmentScore { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SectionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid SubscriptionId { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual Guid UserId { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserFirstName { get; set; }

 [SolrField(Stored = true, Indexed = true, LowercaseCopy = true, TokenizedCopy = true)]
 public virtual string UserLastName { get; set; } 
}

This was an incredible amount of duplication, repetition and fragmentation. To add a reportable property for a user required a change to the subscription object, the section enrollment object and the assignment status object. The increased maintenance overhead and probability for making a typo was a potential deterrent to adding new reportable data to the system.

So, to keep our code DRY (Don't Repeat Yourself), we decided to mirror the nesting of our object graph by using objects and attribute mapping to generate the schema.xml for Solr. We populated the data by calling SQL stored procedures using NHibernate mappings. Because we used the same objects for populating as we did for indexing, we had to keep the associated entity IDs on the objects.

public class Subscription
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual DateTime ExpirationDate { get; set; }

 public virtual Guid UserId { get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "user")]
 public virtual User User { get; set; }
}

public class SectionEnrollment
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int EnrollmentScore { get; set; } // Cannot use Score, as that is used by Solr

 public virtual Guid SectionId { get; set; } // Required for populate stored procedure

 public virtual Guid SubscriptionId { get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "subscription")]
 public virtual Subscription Subscription { get; set; }
}

public class AssignmentStatus
{
 [SolrField(Stored = true, Indexed = true, IsKey = true)]
 public virtual Guid Id { get; set; }

 [SolrField(Stored = true, Indexed = true)]
 public virtual int StatusScore { get; set; } // Cannot use Score, as that is used by Solr

 public virtual Guid AssignmentId { get; set; } // Required for populate stored procedure

 public virtual Guid EnrollmentId{ get; set; } // Required for populate stored procedure

 [SolrField(Prefix = "enrollment")]
 public virtual SectionEnrollment Enrollment { get; set; }
}

This resulted in less code and achieved the same effect by adding "." separators to the schema.xml fields. For example, we used "enrollment.subscription.user.lastname" to signify the user's last name from the assignment status report. Because of this break from the JSON structure, we had to write our own parser for the results that Solr returned. We achieved this by tweaking the JSON parser we already had in place to accommodate "." separators rather than curly braces.

With our object graph finalized and the Solr implementation in place, we began to address the nested update locking issue we had discussed in Full-Text Search - Part 1. We solved this problem in the new system by adding SQL triggers and an update queue. When an entity was inserted, updated or deleted, the trigger inserted an entry into its queue table. Each entity had a separate worker process that processed its table queue and queued up related entities into entity-specific queue tables. This took the work out of the user's HTTP request and put it into a background process that could take all the time it required.

To lessen the user impact even more, the trigger just performed a straight insert into the queue table without checking if an entry already existed for that entity. This had a positive impact for the user but meant that Solr would be hammered with duplicate data. To avoid the unnecessary calls to Solr, we used a distinct clause in our SQL query that returned the top X number of distinct entities and recorded the time stamp of when it occurred. After sending the commands to Solr to update or delete the entity, it then deleted any entries in the queue table with the same entity ID that were inserted before the time stamp.

Solr full-text indexing, coupled with a robust change tracking queue and an easily-implemented attribute mapping system provided us with a solid reporting backend that could be used for all our reporting requirements. We still had to add an interface to use it, but most of the heavy lifting was done. Full-text search was implemented successfully!

About NexPort Solutions Group
NexPort Solutions Group is a division of Darwin Global, LLC, a systems and software engineering company that provides innovative, cost-effective training solutions and support for federal, state and local government, as well as the private sector.
[Continue reading...]

Monday, August 12, 2013

Multiple Session Factories and the Second Level Cache

- 0 comments
In a previous post, we discussed our approach to delaying the delete operation such that the user does not have to pay the price by waiting for the operation to finish. Instead, we set the IsDeleted flag to be true and queue up a deletion task. It has worked well for us; although, we have run into a few issues. Let's look at how multiple session factories interact with the second level cache.

Before we start, let's have a quick look at the NHibernate caching system. NHibernate uses the following caches:
  • Entity Cache
  • Query Cache
  • Collections Cache
  • Timestamp Cache
The issue we are running into is that, by default, NHibernate will clear out the proper cache based on what entities are being inserted and deleted. Let's look at this query.

// Queries DB, inserts into cache
session.QueryOver<Users>().Where(u => u.FirstName = "John").Cachable();
// Pulls result from cache
session.QueryOver<Users>().Where(u => u.FirstName = "John").Cachable();

When NHibernate receives the result from the database, it will store the entities in the entity cache, while storing the set of returned IDs in the query cache. When you perform the query again, It pulls the list of IDs from the query cache and then hydrates each entity from the entity cache.

Now suppose we delete a user in between performing both of these queries, or perhaps create a new one.

// Queries DB, insert into cache
session.QueryOver<Users>().Where(u => u.FirstName = "John").Cachable();
// Marks as stale in cache
session.Delete(john);
// Queries DB again
session.QueryOver<Users>().Where(u => u.FirstName = "John").Cachable();
 
NHibernate would take notice and not pull the second query from the query cache but instead return to the database for the latest information. In this way, NHibernate seems to do a rather great job at taking care of the cache. For a bit more information, have a look at this post by Ayende.

So, if instead, we create the two sessions (in the example above) from different session factories but with identical configurations, then the second level cache will be shared and still be used. But if the delete is performed between, then the second query will still hit the cache.

// Queries DB, insert into cache
session1.QueryOver<Users>().Where(u => u.FirstName = "John").Cachable();
// Marks query and entities as stale in cache
session1.Delete(john);
// Does not notice that session1 marked it as stale, pulls from cache
session2.QueryOver<Users>().Where(u => u.FirstName = "John").Cachable();

It would seem that the sharing of the timestamp cache should take care of this. Perhaps, the timestamp cache is not shared between the factories.

The cache is not designed to be shared between session factories. Normally, chances of key collisions are low due to the use of the entity GUID in the key. But since we create multiple session factories to access the same database, or if you used an incrementing int as the key, key collisions are possible. Most of the time, you could use the region prefix as shown in the blog post or at the bug report.

Where does this leave us? Due to the fact that the DeleteVisibleSessionFactory is only used to access entities about to be deleted, we decided that caching these entities is pointless and disabled caching on it. This prevents it from retrieving any stale data. The last issue is that an entity deleted in the DeleteVisibleSession will not be removed from the second level entity cache. Now we are clearing the entity cache manually after any delete in the event listeners.

NHibernateHelper.EvictEntity(@event.Entity as ModelBase2);

Due to the granular nature of our query cache, we decided to manage them on a per case basis. They often contain the ID of a parent object and need to be cleared individually. This provides us with the best compromise of complexity and performance. Letting us know that the entity cache will be managed properly by NHibernate and that the query cache is our responsibility.


About NexPort Solutions Group
NexPort Solutions Group is a division of Darwin Global, LLC, a systems and software engineering company that provides innovative, cost-effective training solutions and support for federal, state and local government, as well as the private sector.
[Continue reading...]
 
Copyright © . NexPort Solutions Engineering Blog - Posts · Comments
Theme Template by BTDesigner · Powered by Blogger