The AzureDirectory list all blobs under container directly. I suggest to make 'prefix' parameter configurable inside ListAll() function so that we can list blobs depends one specific path (like folder in windows file system)
I thinks you misunderstand what i mean, normally, i have many folders to store indexes, like i can have "product indexes", "news indexes" under root folder "indexes", but from your code, it seems only support to store all indexes under container "indexes", what i can do if i want to store it to "indexes/product-indexes", so it would be better to have "prefix" in the ListAll() function, something like : this._blobContainer.ListBlobs("product-indexes", false, BlobListingDetails.None, null, null), so i can only get the indexs which are under container "indexes" and prefix with "product-indexes"
Have a webrole which accesses the Lucene Index from blob for search.This works good when running using emulator .When published on to web role getting Lucene.Net.Index.CorruptIndexException: Unknown format version:-67108865 expected -4 or higher.
[CorruptIndexException: Unknown format version:-67108865 expected -4 or higher]
Lucene.Net.Index.FindSegmentsFile.Run(IndexCommit commit) +1612
Lucene.Net.Search.IndexSearcher..ctor(Directory path, Boolean readOnly) +74
LuceneSearchwebRole.Search.LuceneSearch._search(String searchQuery, String searchField, AzureDirectory azureDirectory) +205
LuceneSearchwebRole._Default.BtnSearch_Click(Object sender, EventArgs e) +512
System.Web.UI.WebControls.Button.RaisePostBackEvent(String eventArgument) +154
System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +3707
I am looking at creating a lucene.Net azure solution that involves a front end that does the searching and file uploads to blob storage and then en-queues a ref to the file for a worker role (Indexing Component) to dequeue. I was thinking i would install the Microsoft Office 2010 Filter Pack and Adobe PDF IFilter as startup tasks (assuming they have silent installs) during the deployment of the worker role. To index a file the role would first locate the appropriate native .dll that contains the IFilter based on the file extension and then load it. From there call GetChunk on the IFilter to process the document. Is it possible to use an ifilter without having to write the file locally (i.e. from a stream) as the files will already be in blob storage? After the string is built up, the role would then use AzureDirectory to index the document.
Is there a better way to do any of this?
IFilter is implemented with IPersistant file load the give input file using UNC path as parameter whereas from blob we get only Stream or HTTP uri to access the file.
There is IPersistantStream implementation on few IFilters here you can load the stream into global Stream and pass the ISTREAM to load the file and parse it on.
byte streamArray = ReadToEnd(blobstream);
IntPtr nativePtr = Marshal.AllocHGlobal(streamArray.Length);
Marshal.Copy(streamArray, 0, nativePtr, streamArray.Length);
// Create a COM stream
NativeMethods.CreateStreamOnHGlobal(nativePtr, true, out comStream);
// Load the contents to the iFilter using IPersistStream interface
var persistStream = (IPersistStream)filter;
if (persistStream != null)
if (filter.Init(iflags, 0, IntPtr.Zero, out flags) == IFilterReturnCode.S_OK)
Hi, When it AzureDirectory tries to OpenInput for first time say indexing the first document it looks for segments.gen file in blob ad as well as in cahcedirectory which is not found and i am receiving File not Found exception,
But indexing process is successful and i am able search the that particular document.
how can i handle it?..
Hi, is it possible to point the lucence azure library to an azure virtual drive in a worker or web role?
We have two issues with the local cache drive:
Slow to initialize the local cache for the first call, if I understand right, it needs to download all the index data from blob storage.
When the index is huge >200G, we don't have enough space locally.
If we can use the azure drive, can we reuse the index data after the role was rebooted?
So, we don't have to download and with bigger drive size.
I believe we found several bugs in the AzureIndexInput class whereby files from blob storage are always downloaded even if newer files exist in the cache. The changed lines are:
public AzureIndexInput(AzureDirectory azuredirectory, CloudBlob blob)
if (!long.TryParse(blob.Metadata["CachedLength"], out blobLength))
// Fallback to the actual length.
blobLength = blob.Properties.Length;
// Existing code sets this to 0 if "CachedLength" doesn't exist.
// Serious bug here. The CacheDirectory (FSDirectory).FileModified
// returns the UTC file date as the number
// of milliseconds since 1.1.1970 00:00:00 which is not the
// DateTime number of ticks.
DateTime cachedLastModifiedUTC = new DateTime(1970, 1, 1)
Sorry in advance for the formatting. These issues were discovered when upgrading from AzureDirectory 1.0 and Lucene.NET 2.3. Our existing index files are over 600 MB and are always being downloaded.
Can you create a new version 1.0.5.x for users who are still using Lucene.NET 2.x and StorageClient 1.x?
Thanks in advance for your help.
Further to the above, I see the suggested file time handling will not work when the metadata has been written by the AzureIndexOutput. The AzureIndexOutput uses the value provided by the FSDirectory.FileModified method. The following should work correctly whether the metadata exists or not:
// Standardize the handling to the FSDirectory way.
long cachedLastModifiedUTC = CacheDirectory.FileModified(fileName);
if (!long.TryParse(blob.Metadata["CachedLastModified"], out
// The FSDirectory way.
blobLastModifiedUTC = (long)blob.Properties.LastModifiedUtc.Subtract(
new DateTime(1970, 1, 1, 0, 0, 0)
TimeSpan blobDiff = TimeSpan.FromMilliseconds(
blobLastModifiedUTC - cachedLastModifiedUTC
Also, the AzureIndexOutput constructor uses the "_name" for the Mutex before "_name" is initialized. It should be the same as the AzureIndexInput constructor.
"When Lucene asks to for a read stream for a segment (remember segments never change after being closed) AzureDirectory looks to see if it is in the local cache Directory, and if it is, simply returns the local cache stream for that segment. Otherwise it fetches the segment from blobstorage, stores it in the local cache Directory and then returns the local cache steram for that segment."
So my issue is everytime I rebuild and run the project local it needs to fetch all the segments needed for reading everytime which is time consuming in trying to develop the application. What is a work around this issue?
If you mean you are running in azure emulator then just override the location with a temp folder name...something like:
azureDirectory = new AzureDirectory(tempFolder);
azureDirectory = new AzureDirectory()
By default the temp folder which is used is the approved local storage temp folder for azure emulator, and every "deployment" in azure emulator creates a new local folder, which is why you have to download it each time.
I built an index using a winform app on my Windows 7 home addition. When I tried to use a Azure service to search from my local box, it works fine. But when it is deployed to azure, I am getting
[CorruptIndexException: doc counts differ for segment _1: fieldsReader shows 220 but segmentInfo shows 540]
Lucene.Net.Index.DirectoryIndexReader.Open(Directory directory, Boolean closeDirectory, IndexDeletionPolicy deletionPolicy) +68
Maarg.EDIActive.WebRole.Controllers.SearchController.Index(String id, String
Any idea what I did wrong?
Hi. I've got a similar problem to what others are saying, but the responses just say that EOFs are part of normal operation, yet that's not what we're seeing...
We have tried to integrate Lucene.NET and AzureDirectory into our C# project on Azure. AzureDirectory and IndexSearcher are Singletons. And on "Azure Staging" it works great. But when we publish it to "Azure Production", we get a »read past EOF« exception. The AzureDirectory cache only contains 3 files out of 50, so all subsequent searches fail.
private static Lucene.Net.Util.Version LuceneVersion = Lucene.Net.Util.Version.LUCENE_30;
private static Microsoft.WindowsAzure.Storage.CloudStorageAccount cloudAccount = null;
private static AzureDirectory azureDirectory = null;
private static Lucene.Net.Search.IndexSearcher searcher = null;
private static StandardAnalyzer analyzer = null;
string filePath = RoleEnvironment.GetLocalResource("LuceneCache").RootPath;
azureDirectory = new AzureDirectory(cloudAccount, "mydbitems", new Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(filePath)));
searcher = new IndexSearcher(azureDirectory, true);
analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(LuceneVersion);
Thanks for any ideas.