Azure Library for Lucene.Net (Full Text Indexing for Azure)

This project allows you to create a search index on Windows Azure by using Lucene.NET. Indexes are stored in Windows Azure Blob Storage as the persistent storage.

C# (264.1 KB)
 
 
 
 
 
(24)
5,525 times
Add to favorites
10/10/2012
E-mail Twitter del.icio.us Digg Facebook
Sign in to ask a question


  • Best practices on how to integrated Lucence.NET (& AzureDirectory) with ASP MVC projects on Azure
    2 Posts | Last post March 16, 2012
    • We have tried to integrate Lucene.NET and AzureDirectory into our ASP MVC project on Azure.
      
      The search is working OK but we've had a lot problems with generating and maintaining our index:
      -	If we instantiate AzureDirectory and IndexSearcher per request everything is very slow
      -	After we tried to keep AzureDirectory and IndexSearcher as a Singleton then we got a »read past EOF« exception 
      Since we couldn’t get AzureDirectory to work, we have put the index files into the “siteroot” directory of our webpage without using AzureDirectory (the index is generated directly on the WebRole).
      
      The main problem of this solution is that every few hours or days (it’s random!) Azure seems to delete our index files on the WebRole and as a result, the index needs to be regenerated and some of the functionalities of our site isn’t working for the duration of the index generation.
      
      Do you have any solutions with the “EOF exception”?
      
      Do you have any advice on the architecture or some best practices on how to integrated Lucence.NET (& AzureDirectory) with ASP MVC projects on Azure?
      
      Thanks a lot for your help,
      Alen
      
    • AzureDirectory should be a singleton.  You do not want to recreate it on every request.
      Make sure you are using azure local storage and a FSDirectory as your local cache.
      Lucene.NET throws EOF exceptions as part of it's normal execution.  Were you seeing unhandled exceptions?  If not then it is a normal part of it's operation.
      Normally you create a new IndexSearcher everytime you want to have a fresh view of the data.  It is relatively cheap but not free to create a new IndexSearcher, so depending on your data you may decide to do it every request, once a minute, once an hour or whatever.
      
      
  • Ranking data with Lucene.Net
    2 Posts | Last post March 16, 2012
    • Hi,
      
      How do we rank data with Lucene.Net? Can I influence the results by overwriting the default algo?
      
      Thanks in advance.
      
      Mifla
    • There is copious amounts of information about using Lucene for ranking, including an excellent book which you can get on amazon.
  • RAMDirectory
    3 Posts | Last post February 06, 2012
    • Hi,
      
      I have a web role and a WCF service which has a RAMDirectory as cache. My problem is when the WCF-service is not hit by a request (say in 1 hour) IIS just kills the service, and the instance of the AzureDirectory is gone and I have to instanciate it again. It takes 1.5 minute to get everything from the blob to the RAMDirectory again. How can I solve this? I have tried to serialize the RAMDirectory and deserialize it and use that instance then creating the AzureDirectory but it does not use my serialized cache. Thank you very much!
    • You can triple cache it like this:
       AzureDirectory azureDirectory = new AzureDirectory("MyIndex", new RAMDirectory(new FSDirectory(@"c:\myindex")));
      
      This essentially uses ramdirectory as your cache over the local file system cache of the remote data.  This should save you from having to fetch the segments on every start.
      
      That said, I don't know how to prevent IIS from killing the service...that's outside the scope of this forum.
      
      
    • Hi again,
      
      I use version 2.9.4 and the constructor for FSDirectory is internal, what version are you using?
  • Increased index file count after upgrade to Lucene.NET 2.9.4
    5 Posts | Last post February 03, 2012
    • Hi,
      
      Just as "k.c.s." brought up, I have tried manually upgrading to Lucene.NET 2.9.4 and it worked surprisingly well and easy. I am now running some tests and all works well so far.
      
      On the other hand, I have hit a serious issue (for me) that I am sure is purely Lucene.NET related and a consequence of an upgrade. I have a specific application where my readers refresh index fairly often so it is utterly important for me that the Lucene index occupies as few files as possible. With Lucene.NET 2.9.2 I have managed to bring index file count to 3 or 4 while with the upgraded Lucene.NET 2.9.4 the entire index spans 10 files. Believe it or not this is a severe performance hit as requesting 10 files from storage versus 3 makes a few seconds difference despite the fact that the total file size is identical. These tests are from a local debug environment. I have yet to deploy to staging to test there.
      
      I am using the following to minimize file count in Lucene.NET 2.9.2:
      
      writer.SetUseCompoundFile(true);
      writer.Optimize(1, true);
      
      So, these give 3 files in older versus 10 in new Lucene.NET.
      
      Is there another way I can control this any better? Ideally, I'd prefer to have the entire index in one solid file but that is probably impossible. Any ideas?
      
      Thanks!
    • I think you are doing exactly the opposite of what you need to be doing.  Refreshing the readers often is no problem, but why do you think that having the lucene index be as few files as possible is important or even desirable?  You will end up causing the entire index to be replicated and short circuit the whole incremental nature of indexing. 
      
      If you want one file you should just drop using AzureDirectory and copy it around yourself but I seriously  doubt it will be as fast as turning off compound file and optimizing rarely. 
      
      Read my post on the serious perf problems that compound files and optimizations cause with AzureDirectory and distributed nodes, and then read how Lucene incremental segment index files work. 
    • Thanks for your answer and once again, thanks for your great work.
      
      As for my striving towards compound file, this comes purely from first hand experience while debugging my test application. Consider an index of e.g. 5MB. This is not too large. My index will never grow insanely large. Perhaps 20-30MB max. Anyway, my tests are plain and simple. I refresh readers quite often and they need several seconds (3-5) just to pull the updated index (when not in compound mode - approx. 10 files). On the other hand, when I make the index compound (consisting of only 3 files), I get almost 10x the performance increase in terms of pulling the updated index in the web role even with the same amount of bytes.
      
      So if you can tell me how to refresh the readers without having 3-5 seconds delay on each refresh, I would love to ditch the compound approach. With the latter I am getting < 500ms refresh of each reader simply because there are fewer files. Not to mention this all happens on a local fairly quick PC with all Azure services running in the same box. I have not run any tests on the Azure itself, but I am afraid to even do so with multi-second reader latency.
      
      So, to wrap up. I *know* I am doing the opposite of what you recommended (yes, I have read that part) but what I do now is giving me the best reader performance even if the writer is suffering a bit because of its compound index. I think I can live with writer delays if readers are super fast.
      
      Perhaps my debug environment is giving me wrong figures so my assumptions are all wrong?
      
      I'd appreciate any further comments you may have. Thanks!
    • I recommend setting #defines: FULLDEBUG and COMPRESBLOBS
      
      This will output to debug window every time a segment is opened from the cache, downloaded from storage etc.  I would turn off compounds files and optimizations and see what happens.  
      
      What you should see is that only the modified segment is downloaded when it changes.  If you see all segments being downloaded everything then you are doing something wrong. 
      
      You don't need to create a new IndexReader() everytime.  All you should  do is to create a new IndexSearcher() over the existing IndexReader(AzureDirectory()) instance.  (older versions of lucene this used to not be the case, but I think >2.X you can just keep using the single indexreader.)
      
      Let me know what you find out, but if you turn on the debug statements you will have a clear picture of your your code is interacting with the remote blob storage.  You also should not be recreating the AzureDirectory or RamDirectory/FSDirectory() instances either.
      
      It should be something like:
       AzureDirectory azureDirectory = new AzureDirectory("MyIndex", new RAMDirectory(new FSDirectory(@"c:\myindex")));
      IndexReader reader = new IndexReader(ad);
      
      while(true)
      {
           IndexSearcher searcher = new IndexSearcher(reader);
           // use it for X time
           // ...
      }
      
      
      
    • Looking in one of our projects the way we use it is to just grab the reader inside of the IndexWriter like this:
        private void _refreshIndexer()
        {
           _searcher = new IndexSearcher(_indexWriter.GetReader());
        }
      
  • Using with Luke
    4 Posts | Last post January 28, 2012
    • Hi, is there anyway to use Luke to view the index? I've downloaded the index to my local computer but when I point Luke to it it shows the message "No valid directory at the location, try another location."
      
      Thanks
    • I've just tried opening the indexes with Luke in the emuulator's local store and some of them worked so I think I've broken something somewhere.
    • Ah, the blobs in blob storage or normally compressed.  If you point luke to the local cache then you definitely inspect them.  Or you can turn off the compression.
    • Thanks, I can read all the indexes on my local store by turning off compression. 
      
      Also, the file name underscores got removed when I downloaded the files from blob storage with Azure Storage Explorer. Luke could read them after renaming them.
  • segments.gen FileNotFoundException
    6 Posts | Last post January 03, 2012
    • Hi,
      
      I have been playing around with AzureDirectory a bit and it is quite cool and really fast.
      
      I am having one issue though when creating a new directory.
      
      When I call the following: IndexWriter writer = new IndexWriter(_directory, GetAnalyzer(), !indexExists); for a directory that does not yet exist I always get a FileNotFoundException on segments.gen (AzureDirectory.cs line 260).
      
      Has anyone else experienced this, or is there something I am missing when creating a new directory?
      
      Cheers.
    • Lucene.NET uses file not found exceptions as part of it's normal path, so this is normal.
    • Thanks for clarifying this. I was getting the file not found exception as well on the first run (you can duplicate this by deleting the files from local cache directory and then running with a new catalog name).
    • Hi
      I am getting the same exception over and over while trying to run the sample TestApp. I have modified app.config to use my storage account. The aforementioned line keeps throwing file not found on "segments.gen" for me as well. Any idea how to get past this error?
      Following is the stacktrace:
         at Lucene.Net.Store.Azure.AzureDirectory.OpenInput(String name) in C:\Users\shiveshr\Desktop\Azure Library for Lucene.Net (Full Text Indexing for Azure)\C#\AzureDirectory\AzureDirectory.cs:line 274
         at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run()
      
      could someone please help me get past this. 
    • also, when i create a blob myself (tried with both null string and junk string), i start getting an exception thrown at a different location: {"read past EOF"}
         at Lucene.Net.Store.BufferedIndexInput.Refill()
         at Lucene.Net.Store.BufferedIndexInput.ReadByte()
         at Lucene.Net.Store.Azure.AzureIndexInput.ReadByte() in C:\Users\shiveshr\Desktop\Azure Library for Lucene.Net (Full Text Indexing for Azure)\C#\AzureDirectory\AzureIndexInput.cs:line 194
         at Lucene.Net.Store.IndexInput.ReadInt()
         at Lucene.Net.Index.SegmentInfos.FindSegmentsFile.Run()
      
    • as mentioned before, Lucene.Net uses File not found and read past EOF exceptions as part of normal operations. Changed your debugger to not stop on those exceptions.
  • Multiple azure web roles each having its own IndexWriter
    3 Posts | Last post December 22, 2011
    • I'm a newbie with Lucene and Azure, and trying to figure out how to use AzureDirectory.
      If I have multiple instances of my mvc app on azure, each with it's own IndexWriter (without using any queue solutions):
      1. Does AzureDirectory automatically make the IndexWriter not update(A) the index in azure blob if another instance is writing to it at the moment?
      2. If 1 is true, what should I do to make sure that update(A) is eventually written to azure blob? 
      3. If 1 is false, how do i use AzureDirectory to implement something described here, http://code.msdn.microsoft.com/Azure-Library-for-83562538/sourcecode?fileId=18714&pathId=1390456562
      in the "Resolving the concurrent issue" section?
      
      thanks for your input
    • AzureDirectory implements a Lock() method which "locks" the blob storage for write.  IndexWriter uses that lock method to get exclusive access.  So yes, If you create an indexwriter on 1 machine the other machines are prevented from writing to the index. When you close the indexwriter the data is flushed to azuredirectory (and hence to blob storage) and then the indexwriter releases the lock.
      
      Long story short, multiple indexwriters on the same catalog doesn't work so well.
      
      a natural solution is to have a queue with one node which is responsible for updating the index and then N nodes with IndexSearch/IndexReaders consuming the index.
      
      Another solution which is a bit more involved it is to have multiple writer nodes and have node affinity for data coming in (aka a sharding solution).  This usually gets complicated pretty quickly.  
      
      My expreience is lucene can update the index around 10-30 docs a second depending on the complexity of your schema and so having 1 node doing the writing is good enough for most solutions.
      
      
      
    • i realized i put the wrong link in question 3. It should be http://blogs.msdn.com/b/windows-azure-support/archive/2010/11/03/a-common-scenario-of-multi_2d00_instances-in-windows-azure-.aspx
      
      It talks about implementing a wait and merge approach:
              private void checkForMerge()
              {
                  if (_count > 10)
                  {
                     //check the whether the index is locked.
                      try
                      {
                          if (locked…)
                              return;//locked ,so return and wait for next turn…
                      }
                      catch (Exception)
                      {
                      }
                     //do merging here…
                      _count = 0; //zero the counter.
                  }
              }
      
      I can utilize AzureLock on the AzureDirectory to obtain a lock from one instance. If successfull, then close the writer to flush the data to blob storage. If it's locked already, then just do a flush to write data to the local cache, and wait til next turn to flush to blob. Can you see anything wrong with this implementation here? Thanks
  • CloudDrive?
    2 Posts | Last post November 01, 2011
    • I'm wondering why you didn't use a CloudDrive underneath Lucene in this case? In other words, what was the main reason behind creating your own file caching scheme on top of blob storage?
      
      I'm asking this question specifically because we are considering doing the same thing; CloudDrive seems to have some fairly significant performance issues when used as the underlying file store for database-style applications (i.e., applications that store a lot of data, and call ::flush() frequently, resulting in blob store transactions to update the VHD page blob).
      
      Thanks...
    • At the time I wrote this library CloudDrive wasn't available yet.  In theory you could do it on CloudDrive with whatever limitations that clouddrive imposes. I don't know if the semantics of syncronization would all work correctly or not for multiple writers.
  • Great Performance
    2 Posts | Last post October 24, 2011
    • I just ran a test by taking this code and adding a sample app and running in Azure Extra Small Instance (using Local Storage of the instance) and with 100,000 records in the index, I was getting sub-second response times. This is awesome! 
    • I still have to test the multi-user scenario though and I downloaded some books (text) from project gutenberg site and added them to my index (each file.ReadLine() becomes a doc) and now I can see the performance degrading.
  • Caching the AzureDirectory
    2 Posts | Last post October 19, 2011
    • Hi, silly basic question. I have one worker that periodically rebuilds the index every say hour (& saves it to blob storage). Then I have N web roles that read the index:
      
      var dir = new Lucene.Net.Store.Azure.AzureDirectory(cloudStorage, "SearchCatalog", new RAMDirectory());
      
      When they do this, they download the whole index. I only want them to download it once an hour (rather than every time I instantiate the AzureDirectory.
      
      Should I just throw the AzureDirectory object into say a System.Runtime.Caching.MemoryCache?
      
      Thanks!
    • IF you are using AzureDirectory then you should only instantiate it once on the reader and once on the writer.  Every time you create a new IndexSearcher/Reader it will automatically sync just the changes to the client
31 - 40 of 64 Items