Azure Library for Lucene.Net (Full Text Indexing for Azure)

This project allows you to create a search index on Windows Azure by using Lucene.NET. Indexes are stored in Windows Azure Blob Storage as the persistent storage.

C# (264.1 KB)
 
 
 
 
 
(24)
5,532 times
Add to favorites
10/10/2012
E-mail Twitter del.icio.us Digg Facebook
Sign in to ask a question


  • List all blobs by prefix instead of container directly
    4 Posts | Last post Mon 12:20 PM
    • L.H
      The AzureDirectory list all blobs under container directly. I suggest to make 'prefix' parameter configurable inside ListAll() function so that we can list blobs depends one specific path (like folder in windows file system)
    • Isn't the function that a container provides?  Simply create a new container.
    • L.H
      Hi Thermous:
        I thinks you misunderstand what i mean, normally, i have many folders to store indexes, like i can have "product indexes", "news indexes" under root folder "indexes", but from your code, it seems only support to store all indexes under container "indexes", what i can do if i want to store it to "indexes/product-indexes", so it would be better to have "prefix" in the ListAll() function, something like : this._blobContainer.ListBlobs("product-indexes", false, BlobListingDetails.None, null, null), so i can only get the indexs which are under container "indexes" and prefix with "product-indexes"
      		
    • L.H
      and it will be better that you can upgrade the version of "Microsoft.WindowsAzure.Storage.dll", many thanks.
  • Hitting CorruptIndexException when hosted on azure
    1 Posts | Last post February 24, 2014
    • Have a webrole which accesses the Lucene Index from blob for search.This works good when running using emulator .When published on to web role getting Lucene.Net.Index.CorruptIndexException: Unknown format version:-67108865 expected -4 or higher.
      Stack Trace: 
      
      
      
      [CorruptIndexException: Unknown format version:-67108865 expected -4 or higher]
         Lucene.Net.Index.FindSegmentsFile.Run(IndexCommit commit) +1612
         Lucene.Net.Search.IndexSearcher..ctor(Directory path, Boolean readOnly) +74
         LuceneSearchwebRole.Search.LuceneSearch._search(String searchQuery, String searchField, AzureDirectory azureDirectory) +205
         LuceneSearchwebRole._Default.BtnSearch_Click(Object sender, EventArgs e) +512
         System.Web.UI.WebControls.Button.RaisePostBackEvent(String eventArgument) +154
         System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +3707
       
      
  • Lucene in azure
    2 Posts | Last post November 25, 2013
    • I am looking at creating a lucene.Net azure solution that involves a front end that does the searching and file uploads to blob storage and then en-queues a ref to the file for a worker role (Indexing Component) to dequeue. I was thinking i would install the Microsoft Office 2010 Filter Pack and Adobe PDF IFilter as startup tasks (assuming they have silent installs) during the deployment of the worker role. To index a file the role would first locate the appropriate native .dll that contains the IFilter based on the file extension and then load it. From there call GetChunk on the IFilter to process the document. Is it possible to use an ifilter without having to write the file locally (i.e. from a stream) as the files will already be in blob storage? After the string is built up, the role would then use AzureDirectory to index the document.
      
      Is there a better way to do any of this?
    • IFilter is implemented with IPersistant file load the give input file using UNC path as parameter whereas from blob we get only Stream or HTTP uri to access the file.
      There is  IPersistantStream implementation on few IFilters here you can load the stream into global Stream and pass the ISTREAM to load the file and parse it on.
      
      
        byte[] streamArray = ReadToEnd(blobstream);
      
            IntPtr nativePtr = Marshal.AllocHGlobal(streamArray.Length);
            Marshal.Copy(streamArray, 0, nativePtr, streamArray.Length);
      
            // Create a COM stream
            System.Runtime.InteropServices.ComTypes.IStream comStream;
            NativeMethods.CreateStreamOnHGlobal(nativePtr, true, out comStream);
      
            // Load the contents to the iFilter using IPersistStream interface
            var persistStream = (IPersistStream)filter;
            if (persistStream != null)
            {
                persistStream.Load(comStream);
      
                if (filter.Init(iflags, 0, IntPtr.Zero, out flags) == IFilterReturnCode.S_OK)
                    return filter;
            }
      
  • segments.gen file not found exception is occurring when indexing the first document
    1 Posts | Last post November 25, 2013
    • Hi, When it AzureDirectory tries to OpenInput for first time say indexing the first document it looks for segments.gen file in blob ad as well as in cahcedirectory which is not found and i am receiving File not Found exception,
      But indexing process is successful and i am able search the that particular document.
      how can i handle it?..
  • using the azure drive in worker role for local cache
    2 Posts | Last post November 01, 2013
    • Hi, is it possible to point the lucence azure library to an azure virtual drive in a worker or web role?
      
      We have two issues with the local cache drive:
      
      Slow to initialize the local cache for the first call, if I understand right, it needs to download all the index data from blob storage.
      
      
      When the index is huge >200G, we don't have enough space locally.
      
      If we can use the azure drive, can we reuse the index data after the role was rebooted?
      
      So, we don't have to download and with bigger drive size.
    • You will need to use a larger VM so that you have more local disk space.  
  • Version 1.0.5.1 Bug with Lucene.NET 2.9.4.1
    2 Posts | Last post August 08, 2013
    • I believe we found several bugs in the AzureIndexInput class whereby files from blob storage are always downloaded even if newer files exist in the cache. The changed lines are:
      
      public AzureIndexInput(AzureDirectory azuredirectory, CloudBlob blob)
      {
        ...
        long blobLength;
        if (!long.TryParse(blob.Metadata["CachedLength"], out blobLength))
        {
          // Fallback to the actual length.
          blobLength = blob.Properties.Length; 
          // Existing code sets this to 0 if "CachedLength" doesn't exist.
        }
        ...
        // Serious bug here. The CacheDirectory (FSDirectory).FileModified
        // returns the UTC file date as the number
        // of milliseconds since 1.1.1970 00:00:00 which is not the 
        // DateTime number of ticks.
        DateTime cachedLastModifiedUTC = new DateTime(1970, 1, 1)
          .AddMilliseconds(CacheDirectory.FileModified(fileName));
        ...
      }
      
      Sorry in advance for the formatting. These issues were discovered when upgrading from AzureDirectory 1.0 and Lucene.NET 2.3. Our existing index files are over 600 MB and are always being downloaded.
      
      Can you create a new version 1.0.5.x for users who are still using Lucene.NET 2.x and StorageClient 1.x?
      
      Thanks in advance for your help.
      
      A.
      
    • Further to the above, I see the suggested file time handling will not work when the metadata has been written by the AzureIndexOutput. The AzureIndexOutput uses the value provided by the FSDirectory.FileModified method. The following should work correctly whether the metadata exists or not:
      
      // Standardize the handling to the FSDirectory way.
      long cachedLastModifiedUTC = CacheDirectory.FileModified(fileName);
      long blobLastModifiedUTC;
      if (!long.TryParse(blob.Metadata["CachedLastModified"], out 
             blobLastModifiedUTC))
      {
        // The FSDirectory way.
        blobLastModifiedUTC = (long)blob.Properties.LastModifiedUtc.Subtract(
          new DateTime(1970, 1, 1, 0, 0, 0)
          ).TotalMilliseconds;
      }
      TimeSpan blobDiff = TimeSpan.FromMilliseconds(
        blobLastModifiedUTC - cachedLastModifiedUTC
        );
      
      Also, the AzureIndexOutput constructor uses the "_name" for the Mutex before "_name" is initialized. It should be the same as the AzureIndexInput constructor.
      
      Best regards,
      
      A.
  • How to create azure indexing in local storage.
    2 Posts | Last post July 10, 2013
    • Hi,
      
      I am able to index data to live azure account but not in my local. Please let me know if this is possible and how can I do so.
      
      Thanks
      
    • I don't know what you are asking... dev storage?
  • Avoid having to fetch segments everytime during development?
    2 Posts | Last post June 19, 2013
    • "When Lucene asks to for a read stream for a segment (remember segments never change after being closed) AzureDirectory looks to see if it is in the local cache Directory, and if it is, simply returns the local cache stream for that segment. Otherwise it fetches the segment from blobstorage, stores it in the local cache Directory and then returns the local cache steram for that segment."
      
      So my issue is everytime I rebuild and run the project local it needs to fetch all the segments needed for reading everytime which is time consuming in trying to develop the application.  What is a work around this issue?
    • If you mean you are running in azure emulator then just override the location with a temp folder name...something like:
       if(RoleEnvironment.IsEmulated)
            azureDirectory = new AzureDirectory(tempFolder);
       else
            azureDirectory = new AzureDirectory()
      
      By default the temp folder which is used is the approved local storage temp folder for azure emulator, and every "deployment" in azure emulator creates a new local folder, which is why you have to download it each time.
  • Corrupted index
    2 Posts | Last post April 18, 2013
    • I built an index using a winform app on my Windows 7 home addition. When I tried to use a Azure service to search from my local box, it works fine. But when it is deployed to azure, I am getting 
      
       [CorruptIndexException: doc counts differ for segment _1: fieldsReader shows 220 but segmentInfo shows 540]
         Lucene.Net.Index.FindSegmentsFile.Run() +1614
         Lucene.Net.Index.DirectoryIndexReader.Open(Directory directory, Boolean closeDirectory, IndexDeletionPolicy deletionPolicy) +68
         Maarg.EDIActive.WebRole.Controllers.SearchController.Index(String id, String
      
      Any idea what I did wrong?
      
      Thanks
      Jianguo
    • Sounds like it didn't successfully commit blobs to storage?  I haven't seen this one
  • "read past EOF" exception
    1 Posts | Last post February 28, 2013
    • Hi. I've got a similar problem to what others are saying, but the responses just say that EOFs are part of normal operation, yet that's not what we're seeing...
      
      We have tried to integrate Lucene.NET and AzureDirectory into our C# project on Azure. AzureDirectory and IndexSearcher are Singletons. And on "Azure Staging" it works great. But when we publish it to "Azure Production", we get a »read past EOF« exception. The AzureDirectory cache only contains 3 files out of 50, so all subsequent searches fail.
      
      
      private static Lucene.Net.Util.Version LuceneVersion = Lucene.Net.Util.Version.LUCENE_30;
      private static Microsoft.WindowsAzure.Storage.CloudStorageAccount cloudAccount = null;
      private static AzureDirectory azureDirectory = null;
      private static Lucene.Net.Search.IndexSearcher searcher = null;
      private static StandardAnalyzer analyzer = null;
      
      cloudAccount =
          new Microsoft.WindowsAzure.Storage.CloudStorageAccount(
              new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(System.Configuration.ConfigurationManager.AppSettings["LuceneAccount"],
                                                                         System.Configuration.ConfigurationManager.AppSettings["LuceneKey"]), true);
      
      string filePath = RoleEnvironment.GetLocalResource("LuceneCache").RootPath;
      azureDirectory = new AzureDirectory(cloudAccount, "mydbitems", new Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(filePath)));
      
      searcher = new IndexSearcher(azureDirectory, true);
      analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(LuceneVersion);
      
      
      Thanks for any ideas.
      -Jerry-
1 - 10 of 64 Items