|
|
Hi, Thanks for your great work, it made Lucene indexing in Azure much easier. We have a worker role to do the indexing and create or update the catalog, and a web role that performs the search. The web role periodically checks for any catalog updates, and refreshes the IndexSearcher in order to get the latest updates. We are using an FSDirectory as the cache in both roles. Our problem is that this will download the new segments, but not remove any deleted segments, so the cache directory in the web role keeps increasing in size with every index refresh. The option to use a RAMDirectory in the web role might not be so good, since our catalog is more than 1GB in size.
Yes, this is a bit of a problem, but in reality I would be extremely suprised if the size of your index exceeded the amount of local disk space available. Small VM is 250GB, Medium VM is 500GB, large is 1TB Extra large VM is 2 TB. That said, you can delete any file in the cache at any time and if is still needed it will simply be downloaded again. A simple way of doing this would be to use System.IO.File.LastAccessTime to delete. You could just periodically delete any file which has (DateTime.UtcNo w - file.LastAccess TimeUtc) > TimeSpan.FromDa ys(5) (or whatever timespan you want) If you make a mistake it will download the missing file again, so the only impact is network churn.
Are there any good practices for using this with NHiberate search? My solution seems a little brittle at the moment. The main problem is a lock file getting stuck (in the blob storage) so I have to go and manually delete it. I've overridden QueuedFullTextIndexEventListen er to instead add a message to a queue for a single worker role to pick up which indexes it. I guess I'm after some ideas on how to handle errors in the indexer role and what to do if there is a problem with the index. My question's a bit vague but some ideas would be appreciated.
At the time I wrote this Azure Didn't have any expiring lock. They now have Lease Blobs which I believe actually encapsulate the semantics of what the lock blob is doing. Just change the lock implementation to use a Lease Blob and if it gets orphaned Azure will expire the lease and let someone else use it. If I get a chance I will change AzureDirectory to use that...just a bit busy right now.
I Just uploaded v1.0.5 of the source which uses Azure Lease Blobs for the locking mechanism which means that lock files will lose their lease after 60 seconds if they aren't renewed preventing orphaned lock files
I've been running this update (v1.0.5) for a few weeks now and haven't had any issues. It did worry me a bit at first because the .lock files stay around but I guess it's meant to work like that as it's been working great.
yes, the change was to create a lock file and then use blob Lease command on that lock file. So the lock file is not created/deleted, instead a Lock on that file is created which expires if it is not kept alive.
Hi Tom Firstly, many thanks for this project - its saved me a lot of development. I came across an issue when testing between DevStorage and the Azure Storage Service (either Staging or Production). When you instantiate AzureDirectory, the blob filenames from the container are cached in C:/Windows/Temp/AzureDirectory /ContainerName. My segments filename in DevStorage is segments_8 whereas in Production the segments filename is segments_4 When you switch from testing against DevStorage to Azure Production, AzureDirectory reports and error when its instantiated - 'cannot find file segments_8' (or similar wording). It appears that the instantiation process is looking for the older file firest (both segments_8 ad segemnts_4 are listed in c:/windows/temp /azuredirectory /containername. The wordaround is straightforward , simply clear out the ...temp/azuredi rectory folder locally. Just thought I'd let you know and for anyone else who might encounter the same issue Regards Kieran
Keeping the local cache is really by design (so that your cache survives a reboot). If you are connecting to 2 different backend stores, you should really just use 2 different local paths to cache the state of your local dev storage and production storage.
If I install the current release 1.05 on Worker Role it causes an error about StorageClient 1.1 not being present. If I Azure Directory in a web role it works even though my app uses StorageClient 1.7. If build Azure Directory from Source and update it to .Net 4 and use StorageClient 1.7 the issue goes away. However I have to write Disposable(). Should this code work with projects using StorageClient 1.7?
Can this codebase be open sourced so that it can continue to be updated by everyone who are using it and keep it current? Regards,
Hi,
great job all of this!
I'm trying to see how could I have multiple directories under one azure blob container using AzureDirectory. Currently AzureDirectory seems to copy all lucene files directly inside the container ('catalog' constructor parameter). So if I had 2 different search directories under one container, the files would presumably conflict? What would be the best way to resolve this? (I would prefer to keep one azure blob container for all my lucene files)
Thanks,
Stevo
Hmm...currently each catalog is a separate container to prevent any sort of conflicts, but it should be reasonably easy to change the code to be directory scope based...enumeration would have to be within the container/subpa th etc. But why not just use containers?
Is it correct that once a IndexReader/IndexSearch is reopened only the incremental changes will be downloaded from blob store and not the full index? Also for the already opened IndexReader will there be some indication if the underlying index on the blob storage changes? For example will the IsCurrent() method return false if the index changes?
Creating a new IndexSearch will get just incremental changes. (This is because a new segments file will be downloaded which then will reference new segments that are the incremental changes which will be downloaded.) I don't know if IsCurrent() will reflect the remote state or not.
We do all update index operations from a worker role, sequentually. And then we consume the index from a web role. Time to time for some reason we cannot create a Searcher object because error says "segment_x". The investigation shows the segment_x does not presented in the BlobStorage. According to AzureDirectory: "File Not Found" 404 error. All next going update operations work smoothlly without any crashes or errors but since the error happened first time the index consuming does not work properly, because as was mentioned for some reasons a segment could not be found in the BlobStorage. I would be really appreciate any suggestion and any help, guys.
I am trying to confirm if it should work and I am doing something wrong or if using this as is with Lucene.NET 2.9.4 does require code changes to AzureDirectory. When I do the following to the AzureDirectory project (version 8/31/2011) I receive several errors: What I Did: -Download AzureDirectory, unblock zip, extract. -Open the solution in VS 2010 -Change both the AzureDirectory and TestApp properties to target the .NET 4 Framework instead of 3.5. -Remove the reference in both projects to Lucene.NET that points to the 2.3.1.3 DLL provided in the zip. -Add a reference to Lucene.NET 2.9.4 in both projects via NuGet. -Save, Build Solution, get errors. Errors Are: -Lucene.Net.Store.Azure.AzureD irectory' does not implement inherited abstract member 'Lucene.Net.Sto re.Directory.Di spose() -Member 'Lucene.Net.Sto re.Azure.AzureD irectory.List() ' overrides obsolete member 'Lucene.Net.Sto re.Directory.Li st()'. Add the Obsolete attribute to 'Lucene.Net.Sto re.Azure.AzureD irectory.List() -Member 'Lucene.Net.Sto re.Azure.AzureD irectory.Rename File(string, string)' overrides obsolete member 'Lucene.Net.Sto re.Directory.Re nameFile(string , string)'. Add the Obsolete attribute to 'Lucene.Net.Sto re.Azure.AzureD irectory.Rename File(string, string) -Lucene.Net.Sto re.Directory.Li st()' is obsolete: '"For some Directory implementations (FSDirectory}, and its subclasses), this method silently filters its results to include only index files. Please use ListAll instead, which does no filtering. -Lucene.Net.Sto re.Directory.Re nameFile(string , string)' is obsolete
I haven't done the work, but it doesn't sound like big changes. Sounds like you need to implement Dispose(), add an annotation and tweak the other things.
Will it be possible to open source the code so we can contribute such changes? I've uploaded a fixed NuGet version for this issue: http://nuget.org/packages/Luce ne.Net.Store.Az ure