Azure Library for Lucene.Net (Full Text Indexing for Azure)

This project allows you to create a search index on Windows Azure by using Lucene.NET. Indexes are stored in Windows Azure Blob Storage as the persistent storage.

C# (261.6 KB)
 
 
 
 
 
(14)
1,985 times
Add To Favorites
3/17/2012
E-mail Twitter del.icio.us Digg Facebook
Sign in to Ask a Question


  • Bugs/Slow or is it just me?
    4 Posts | Last Post June 17, 2011
    • Hi Thermous, thanks for making this library freely available. I am using it and am finding it slow (so far I'm only at the adding items to the index). I'm only also working on localdev at the moment. The code I have for adding is as below
      
              /// <summary>Adds to the search index</summary>
              public void AddToIndex(IEnumerable<Message> messages)
              {
                  // create the directory if not exists
                  var indexWriter = new IndexWriter(azureDirectory, Analyzer, !folderExists, IndexWriter.MaxFieldLength.UNLIMITED);  // Slow , take around 2 second on each call
                  indexWriter.SetRAMBufferSizeMB(20.0);
                  indexWriter.SetUseCompoundFile(false);
                  indexWriter.SetMaxMergeDocs(10000);
                  indexWriter.SetMergeFactor(100);
      
                  foreach (var item in messages)
                  {
                      var newDoc = CreateLucentDoc(item.Text, item.Author, item.MessageId, item.ParentMessageId, item.DateOfMessage, item.MessageType, item.AccountId);
                      indexWriter.AddDocument(newDoc);
                  }
      
                  // cleaning up
                  //indexWriter.Commit();
                  indexWriter.Close();    // Slow , take around 2 second on each call
              }
      
      
      I've read your documentation that the AzureDirectory is suppose to have smarts in it, and compresses the index, only uploads what is needed, but finding it slow at the above points. 
      
      I also find OpenInput an exception FileNotFoundException gets thrown when I try to call it in parallel. 
      
      And quite a few times, the DeleteFile has thrown an exception that it cannot delete the file. 
      
      What is the best way to use the library, should I put the items to be indexed in a queue, and using a serial pattern, load the indexes that way to avoid conflicts?
      
      Advice please. Many thanks
    • Ughh. The formatting of this forum leaves a lot to be desired. No CR support? Seriously? 
      Anyhow, 
      #1) SetRamDirectory() has nothing to do with AzureDirectory. I would look at lucene documentation.
      #2) You should only have one instance of AzureDirectory, and one instance of the IndexWriter().  There is no need to close the IndexWriter, just keep it around.  You can call Commit() periodically if you want on it and just close on system shutdown.  When you create an IndexSearcher on the same node you can get the AzureDirectory instance from the writer via indexWriter.GetDirectory().
      #3)Compression requires you to set #define COMPRESSBLOBS
      #4) If you don't like the perf of AzureDirectory you can always just use FSDirectory and periodically create a Snapshot(), and push all of the files up to blob storage. This is a perfectly reasonable way to do it and you don't need AzureDirectory at all. 
      
      
      
    • Thanks Thermous. I was putting it in a Parallel.For and it thats when it really broke apart, I've since changed it to insert serially and is working much better. I have an index for each user, each user will get its own lucene container to keep it manageable. I wont implement the FSDirectory local store for now as there are alot more functionaliy I still need to write up. That'll be on the todo list ! 
    • In theory you should be able to use a single IndexWriter/AzureDirectory from multiple threads.  You should NOT create multiple AzureDirectory/IndexWriters. I have not tried so your mileage may vary.
  • Cache and Azure Instances
    2 Posts | Last Post June 15, 2011
    • How does cache work across azure instances? I mean, how does the cache keep the data in-sync across all azure web-instances? When the master directory changes, how does it propapage to all other instances?
      Cheers
    • AzureDirectory implements a streaming an abstract Directory interface which represents streams of files. 
      The way Lucene uses those is that it it always appends and closes when it flushes.  At that point AzureDirectory will zip up and upload to blob storage the new file. The other thing that Lucene does is to use a segment.gen file to control the list of files which make the index. It always updates that file after it finishes commiting files, so that means that the new state of the index doesn't update until the segment.gen file is updated, which is uploaded to segment.gen
      
      Finally, the directory has a CreateLock() implementation which is done via creating a write.lock file in blob storage.  This controls that only one node can update the index at a time, but any nodes can pick up and utilize the new segments whenever a new segment.gen file is updloaded to blob storage.
      
      You might think this is pretty slow, but in fact the segment files are cached local (excpet for the lock file and segment.gen file) which means that the bulk of the data is already in the right place most of the time.
      Since it is compressed, the time to upload/download is kept to a minimum.
      
  • Regarding Scaling and Index Update
    2 Posts | Last Post June 10, 2011
    • I have a use case where in any user can upload a document via the upload service hosted in the webrole. These documents are stored in the blob storage and the metadata is sent to the indexwriter queue that is consumed by a worker role on the otherside for indexing. Now the user's who are uploading the documents would like to see these documents in the search result, but the indexsearcher object is old. with multiple users being catered by the webrole, what is the best strategy to refresh the indexsearcher.
      
      Also please throw some light on possibilities of index sharding using azuredirectory (lucene.net) to make it a more scalable solution.
      
      Thanks
      Amar
    • Since you have split the writer off from the searcher you need to coordinate them. The new indexed data is not really available to the searcher until the segments are  flushed from the writer, up to blob storage, and the searcher role then creates a new searcher that gets the new segments.
      
      There are several ways you can deal with this:
      1. change writer to call Commit() after each write (this is relatively expensive and only appropriate if writes are a rare event).  This will cause the segments to be uploaded to blob storage.  Then the writer needs to notify the searcher to create a new Searcher object...when that happens it will notice the new segments and download.  Again, this approach is fine if writes are fairly rare.
      2. Another approach is to shard the index, which works just fine with azuredirectory.  Each catalog is a separate shard, and if you can collocate the writer with searcher then there is less coordination from the standpoint of updates (instead of commit(), you just create a new Searcher() and the data will be there.) To support sharding you simply have multiple catalogs with multiple writers and a searcher per catalog.  You then use Lucene's MultiSearcher() object to do federated queries across all of the catalogs.  You can even create a MultiSearcher() that talks to multiple RemoteSearchers() which then allows you to have distributed search against sharded catalogs.
      
      Ultimately, Lucene is supremely flexible in the topology that you can put together.  I strongly recommend the book "Lucene In Action" by Erik Hatcher, as it goes over all of these strategies.
      
      The way that AzureDirectory plugs into this is that given you have a catalog, you can have it automatically backed up by blob storage.  But there is nothing that requires you to use AzureDirectory to use Lucene on Azure.  It just depends on your needs and from a query load/write load/persistence/privacy perspective on the best way topology to use.
      
      
      
      
  • Azure Connection Closed
    2 Posts | Last Post June 07, 2011
    • Hi.. While using AzureLibrary with Lucene I am running into the below exception/issue frequently and wondering if you have any idea what could be causing it.  
      
      I have found other posts that talk about connection defaults, but have no idea how the Azure Library handles those or how to configure them..  Any help would be appreciated.
      
      ************************************************************************************************
      Message: Unable to read data from the transport connection: The connection was closed. Data: System.Collections.ListDictionaryInternal TargetSite: T get_Result() HelpLink: NULL Source: Microsoft.WindowsAzure.StorageClient StackTrace Information Details: ====================================== at Microsoft.WindowsAzure.StorageClient.Tasks.Task`1.get_Result() at Microsoft.WindowsAzure.StorageClient.Tasks.Task`1.ExecuteAndWait() at Microsoft.WindowsAzure.StorageClient.TaskImplHelper.ExecuteImpl(Func`1 impl) at Microsoft.WindowsAzure.StorageClient.CloudBlob.UploadFromStream(Stream source, BlobRequestOptions options) at Microsoft.WindowsAzure.StorageClient.CloudBlob.UploadFromStream(Stream source) at Lucene.Net.Store.Azure.AzureIndexOutput.Close() at Lucene.Net.Index.CompoundFileWriter.Close() at Lucene.Net.Index.DocumentsWriter.CreateCompoundFile(String segment) at Lucene.Net.Index.IndexWriter.DoFlushInternal(Boolean flushDocStores, Boolean flushDeletes) at Lucene.Net.Index.IndexWriter.DoFlush
      
      ****************************************************************************************************
    • sounds like your connection keeps dropping on uploading of blobs.  You may have to increase connection timeouts
  • Solr?
    2 Posts | Last Post June 07, 2011
    • Have you thought about implementing Solr in C# as well?
    • No, and to be clear, I didn't implement Lucene.NET, I implemented just a Directory class which abstracts the storage part of Lucene.NET so that it can be persisted to blob storage easily.
  • Indexing SQL Azure database with Lucene
    2 Posts | Last Post June 07, 2011
    • Hi,
      
      I'm not sure if this is the appropriate place to ask the following question, so I apoligize in advance for any inconvenience. But I'd love to use some pointers on this.
      
      Does this Azure Library for Lucene allow you to index a table in an SQL Azure database? If it doesn't, then what would be the approach for getting data from my SQL Azure database to be indexed by Lucene.Net?
      
      Also, if the data is on an SQL Server but Lucene is on Windows Azure, what would be a way to index the data?
      
      Thanks again. Any help or advices on this is very much appreciated.
      
      Best,
      Quan.
    • You can index anything, because you are responsible for all logic of enumerating records and storing them in Lucene. 
      
      You would have to have the sync logic (either hook every write so it writes to Lucene, or do incremental crawls).  
  • Windows Phone 7
    2 Posts | Last Post June 07, 2011
    • Will this work on Windows Phone 7? If so, will it work when the phone is offline?
    • No, no.
      
      For it to work would mean distributing your azure storage keys to the client which would be a very bad idea from a security standpoint.
      
      The normal way to do it is to create a REST/WCF service endpoint to do the search which internally uses AzureDirectory to search a lucene.Net service
      
      You could in theory run Lucene.NET local to the phone, but it relies on APIs which don't exist on the phone, so it would be a port.
      
  • Can I use optimize on indexes larger than our memory size.
    2 Posts | Last Post June 01, 2011
    • We will have to index huge amounts of data (few 100 TBs). The index will be in blobs but not in the same directory necessarily. We will shard it so that an individual index writer doesn't look at more than a terra byte at most.
      We have around 16 Gigs of memory on the nodes in Azure. Can we call optimize on an index larger than the RAM size?
    • Yes, I don't know of any reason you shouldn't be able to. I will note that that's a very large index and it will take a very long time.
  • Indexing SQL Azure database with Lucene
    2 Posts | Last Post May 07, 2011
    • Hi,
      
      I'm not sure if this is the appropriate place to ask the following question, so I apoligize in advance for any inconvenience. But I'd love to use some pointers on this.
      
      Does this Azure Library for Lucene allow you to index a table in an SQL Azure database? If it doesn't, then what would be the approach for getting data from my SQL Azure database to be indexed by Lucene.Net?
      
      Also, if the data is on an SQL Server but Lucene is on Windows Azure, what would be a way to index the data?
      
      Thanks again. Any help or advices on this is very much appreciated.
      
      Best,
      Quan.
    • You can index anything you want, so sure, you can use it to provide a full text index of the rows that are going into the SQL AZure database.  You basically have to write the code yourself to get the data into there.  You do this by either:
      a. querying for new rows periodically (aka, crawling)
      b. hook the write pipeline for the code that puts data into the database and have it send the key (or entire row) to the indexer code to put into the index.
  • Sharing instance of AzureDirectory by Writer and Reader
    2 Posts | Last Post May 07, 2011
    • Is it safe (or recommended) to share the same instance of AzureDirectory between IndexSearcher and IndexWriter? I.e. in the lowest scale, when searcher and indexer run on the same role instance. If so the searcher would be always up-to-date since writer writes through the shared cache. Can you comment, please?
    • Having the searcher and indexer on the same instance is no problem, but don't share at the AzureDirectory level, share at the indexreader level. You can construct an IndexSearcher around an IndexWriter's Reader (which is using AzureDirectory). 
      
      (from memory, so might be incorrect but the spirit is that you can do something like this)
         IndexSearcher searcher = new IndexSearcher(indexWriter);
      or maybe:
         IndexSearcher searcher = new IndexSearcher(indexWriter.GetIndexReader());
      
      What is great about this is that the indexsearcher then is using the in-memory indexed data, not just the data that has been flushed to the directory.
      
      
31 - 40 of 43 Items   
« First   < Prev   1  2  3  4  5    Next >   Last »