Is there a known easy way to index docs & pdfs? I believe you can use ifilters to get the text element of other file types, but is there an easy way to get these to run these on an azure worker role, e.g. the microsoft ones seem to be in an install file? not great for dynamically loading instances..... (our users currently upload docs, and although we can index the filenames and some other descriptors that they add, it would be great to index the contents of the file).
Sure, you can use IFilter but to get them installed you need to turn on "windows search" service, which you can do by invoking the following powershell<br/> powershell -command "Set-ExecutionP
olicy RemoteSigned -Force"<br/> powershell -command "Import-Module servermanager; Add-WindowsFeat ure FS-Search-Servi ce; Set-Service wsearch -startuptype manual; Stop-Service wsearch"<br/> This installs the windows search service, which has the word breakers and IFilters you need. For PDF support you need to install an IFilter for PDF (Windows doesn't have one native). After that, you can invoke IFilter on stream of document to get tokens which you can then stick into Lucene.
powershell -command "Set-ExecutionP
olicy RemoteSigned -Force"
powershell -command "Import-Module servermanager; Add-WindowsFeat
ure FS-Search-Servi ce; Set-Service wsearch -startuptype manual; Stop-Service wsearch"
NOTE: If all you want is to index a bunch of documents on a single node, you can use Windows Search by dumping files to disk on your azure VM. See http://msdn.mic
rosoft.com/en-u s/library/ee872 109(VS.85).aspx
What would be the disadvantages of periodic Commit() on the indexWriter? It would make sense to do it since the local instance storage is not persistent after a reboot so if your WorkerRole/Inst
ance goes down, you could lose all RAM buffered but uncommitted index data...
yep. If this is a big concern there are a couple of ways of dealing with this. You could NOT use AzureDirectory directly, but instead simply have a local catalog which is updated and comitted() as much frequently as you want. To persist to blob storage and distribute to the rest of the nodes you could then just take a Snapshot() of the local catalog. Then you create an AZureDirectory and copy/delete any changed/deleted segments from the local catalog to the AzureDirectory (making sure to take a lock while doing so). This would essentially give you high performance local indexing with periodic backup to the blob storage which IndexSearchers with AzureDirectory would pick up automagically just like normal
How would the indexWriter work in an Azure worker role with multiple instances(let's say with 2 instances which Microsoft recommends for 99.9% SLA) since it works by locking the index and multiple instances mean that the indexing could be handled by any of the instances?
In theory Lucene gathers up a batch of changes in memory which then gets persisted to disk in a block (read lucene docuementation about configuration parameters which control how much is in memory before flushing to disk.) It is only on the flush that a write lock happens, so again, it totally depends on the data, the update frequency, etc. Calling Commit() flushes memory operations to disk, so calling too often is more locks more disk/netowork churn in exchange for knowing your data is committed. This is true for any Lucene solution, but with AzureDirectory it is double true as it causes write lock and pushing of segments up to blog storage and out to other readers/writers
I have been playing with Lucene.Net and now wanted to try out your Azure implementation. Being the use-the-latest-
freak, I have replaced Lucene.Net.dll (v2.3.1) distributed with your source with the currently latest version 2.9.2. I get several build warnings regarding obsolete members. I am wondering if these can be fixed easily or if they can be safely ignored? I have tried mapping obsolete members to some non-obsolete ones but I failed. Do you recommend running your code against Lucene.Net 2.9.2? I see your changelog mentions 2.9.2 compatibility in v1.0.3 but the strange thing is that your current source code is tagged as v22.214.171.124. Can you please advise? Here is a list of warnings. warning CS0672: Member 'Lucene.Net.Sto re.Azure.AzureD irectory.List() ' overrides obsolete member 'Lucene.Net.Sto re.Directory.Li st()'. Add the Obsolete attribute to 'Lucene.Net.Sto re.Azure.AzureD irectory.List() '. warning CS0672: Member 'Lucene.Net.Sto re.Azure.AzureD irectory.Rename File(string, string)' overrides obsolete member 'Lucene.Net.Sto re.Directory.Re nameFile(string , string)'. Add the Obsolete attribute to 'Lucene.Net.Sto re.Azure.AzureD irectory.Rename File(string, string)'. warning CS0618: 'Lucene.Net.Sto re.Directory.Li st()' is obsolete: 'For some Directory implementations (FSDirectory}, and its subclasses), this method silently filters its results to include only index files. Please use ListAll instead, which does no filtering. ' warning CS0618: 'Lucene.Net.Sto re.FSDirectory. GetDirectory(st ring)' is obsolete: 'Use Open(File)' warning CS0612: 'Lucene.Net.Sto re.Directory.Re nameFile(string , string)' is obsolete
I would like to know the answer for this as well...
Lucene 2.9.2 should be fine, AFAIK the obselete methods still work but in Lucene 3.X they will go away. They are just giving you a heads up that to go to Lucene 3.X you will have to change that stuff.
Reposting the question for better readibility. There were 2 questions but the formatting of this is poor (pardon me if the repost looks even worse) ===============
=============== =============== = A few questions: QUESTION 1:------------- --------------- --------------- --------------- --------------- --------------- --------------- ------- - What happens if I have 100GB of indexed data in blob storage but my searcher WorkerRole instance is extra small (20GB of local storage) and I am using the instance/worker file system as my local cache directory? Would this actively manage the local cache directory so that recent search index segment files (http://www.luc idimagination.c om/blog/2009/03 /18/exploring-l ucenes-indexing -code-part-2/) are cached locally and as they fill up the 20GB limit, it will delete the files that haven't been used from local cache to download the index files needed for the current search? I am just trying to understand how I would design this so that I can scale it (the extra large instance gives 2TB of instance storage, what happens after that to local cache folder?) QUESTION 2:------------- --------------- --------------- --------------- --------------- --------------- --------------- ------ - Would this work with multiple compute instances/Worke rRoles for index writer (as the scaling needs increase) or we are limited to one instance only for writing because of locking need? This is assuming that the index writer workerrole is pulling data from the Azure Queues. --------------- --------------- --------------- --------------- --------------- --------------- --------------- -------------- P.S.: Sorry, if I asked something really obvious. I am new to both Azure and Lucene but there is no other option in Azure for full text searching (SQL Azure doesn't have full text search) or for large volume of data (SQL Azure is limited to 50GB and is extremely expensive for low transaction high volume data scenario).
Does this support Lucene's MultiReader and MultiSearcher methods for multiple local cache directories (say multiple Azure Drives if the index takes more space than 1TB)?
Yes, it supports the full Lucene stack, as it is just an abstraction of how Lucene talks to the file system. MultiReader/Mul
tiSearch etc all work just fine. --------------- --------------- --------------- --------------- - It does nothing to support catalogs larger than local storage, you have to deal with that issue yourself. --------------- --------------- --------------- --------------- ---- With regards to multiple roles writing it totally depends on how frequently you are updating your index. Using file locks as the locking mechanism is definitely a bottleneck. There a number of approaches you could take to deal with that. You could shard your data into multiple catalogs, which allows multiple writers without locks, and then use MultiReader/Mul tiSearcher to have a single view. In theory Lucene gathers up a batch of changes which then get persisted to disk in a block (read lucene docuementation about configuration parameters which control how much is in memory before flushing to disk.) It is only on the flush that a write lock happens, so again, it totally depends on the data, the update frequency, etc.
Different departments will be using the application but they aren't allowed to full text search the documents from each other. So each department should get their own Directory, reader and searcher. Will this give any performance issues when I'm using multiple searchers and readers, all using a different directory. So deparmentA will have a searcher and a reader using DirectoryA, deparmentB will have a search and reader using DirectoryB etc.
Should be no problem to have as many multiple catalogs as you want to use.
Any idea why calling IndexWriter.Opt
imize() would blow up?... [WaWorkerHost.e xe] Handling exception raised whilst writing Lucene index: System.IO.IOExc eption: background merge hit exception: _40:C120361 <...snip...> _3q:C10000 into _42 [optimize] [mergeDocStores ] ---> Microsoft.Windo wsAzure.Storage Client.StorageC lientException: The specified blob already exists. ---> System.Net.WebE xception: The remote server returned an error: (409) Conflict. at System.Net.Http WebRequest.EndG etResponse(IAsy ncResult asyncResult) at Microsoft.Windo wsAzure.Storage Client.EventHel per.ProcessWebR esponse(WebRequ est req, IAsyncResult asyncResult, EventHandler`1 handler, Object sender) --- End of inner exception stack trace --- at Microsoft.Windo wsAzure.Storage Client.Tasks.Ta sk`1.get_Result () at Microsoft.Windo wsAzure.Storage Client.Tasks.Ta sk`1.ExecuteAnd Wait() at Microsoft.Windo wsAzure.Storage Client.CloudBlo b.UploadFromStr eam(Stream source, BlobRequestOpti ons options) at Lucene.Net.Stor e.Azure.AzureIn dexOutput.Close () in C:\Code\AzureDi rectory\AzureDi rectory\AzureIn dexOutput.cs:li ne 117 at Lucene.Net.Inde x.FieldsWriter. Close() in c:\prg\lucene.n et\2.9.2\src\Lu cene.Net\Index\ FieldsWriter.cs :line 179 at Lucene.Net.Inde x.SegmentMerger .MergeFields() in c:\prg\lucene.n et\2.9.2\src\Lu cene.Net\Index\ SegmentMerger.c s:line 456 at Lucene.Net.Inde x.SegmentMerger .Merge(Boolean mergeDocStores) in c:\prg\lucene.n et\2.9.2\src\Lu cene.Net\Index\ SegmentMerger.c s:line 235 at Lucene.Net.Inde x.IndexWriter.M ergeMiddle(OneM erge merge) in c:\prg\lucene.n et\2.9.2\src\Lu cene.Net\Index\ IndexWriter.cs: line 5860 at Lucene.Net.Inde x.IndexWriter.M erge(OneMerge merge) in c:\prg\lucene.n et\2.9.2\src\Lu cene.Net\Index\ IndexWriter.cs: line 5380 --- End of inner exception stack trace --- at Lucene.Net.Inde x.IndexWriter.O ptimize(Int32 maxNumSegments, Boolean doWait) in c:\prg\lucene.n et\2.9.2\src\Lu cene.Net\Index\ IndexWriter.cs: line 3303 <...snip...>
Apologies for the formatting of that stack trace.
It would be good to figure out what blob/file already exists...I haven't seen an error/stack trace like this so I'm not certain what you are hitting.
It seems this was a limitation of the storage emulator as documented here: http://msdn.mic
rosoft.com/en-u s/library/gg433 135.aspx
Ah, that makes sense, I've only tested it against live azure storage, not the local storage. I long ago learned not to trust the storage emulator as it just is not a 100% emulation of the live service.
I have one worker role which is adding documents to an index, and a separate web role that's querying it. I'm frequently having issues with errors long the lines of this, from the web role: System.IO.FileN
otFoundExceptio n: _2u.tis Can you point me in the right direction on this? I'm sure it's something obvious, I'm just not having any luck with searches.
I have tracked this down somewhat to an exception that's being swallowed inside AzureDirectory. The underlying exception is IOException, one of my .tis files is being used by another process. This is being replaced with a FileNotFoundExc
eption for some reason.
I'm not certain what you are hitting. One way of diagnosing things is to simply back off to using a local FSDirectory instead of AzureDirectory. If you are still experiencing problems then you know you are probably doing something wrong with regards to the way you are using it.
Have you guys had any luck finding a solution for this one? I'm having the same issue... I have separate instances handling indexing and searching and have all the coordination between the two working correctly. Everything works perfect with the FSDirectory but as soon I use AzureDirectory I start getting fileNotFound exceptions when I update the index. One of the files it's failed on is "_h.cfs" but a few others have been returned as well. The errors are occurring on both the writer and searcher. Thanks, Tim
Please review the patch at http://pastebin
.com/d5HZ2YJ0, which I believe has fixed this problem. It seems that the underlying problem was related to Mutex permissions. *WORKS FOR ME*
I've updated the code to use the Mutex solution that Andy added. Thanks Andy
Hi,I'm working testApp. I've set blobStorage in App.config. When Program.cs start, he configure CloudStorageAcc
ount.SetConfigu rationSettingPu blisher whit App.config and it's , but when come back and call new AzureDirectory it don't have a storegeAccount, storegeAccount is null!Why??. Is this problem on my Windows Azure Emulator or I've error on app.config.
sounds like a problem with your app.config
Hi Thermous, Thanks for making available this library! I am not sure if this has been clarified before, so I kindly request your help on the following: My question is when the lucene IndexWriter commits, how many blobs are uploaded to blob storage? Just one or multiple depending on the size of the indexed documents? In relation to this, if the writer is committed before the added documents reach 64MB, then I assume that a single blob is uploaded, however if the documents added before committing the writer are over 64MB, which of the following is true: 1) there is a blob uploaded for every document or 2) a single blob with many blocks (one per document) is uploaded? The reason I am asking is because I have to estimate the number of transaction my application will make against the blob storage. Thank you very much in advance
Read the book Lucene in Action. every segment file maps to a blob, but the number of documents in a segment file totally varies based on merge policy, number of documents, calls to commit etc. I can't possibly give an answer without knowing all of that information and even then I would be wrong. The normal merge policy is x docs per segment, and then a factor of 10 merge policy, which would be when you get 10 segments you merge into 1 segment with 100X, when you get 10 segements of 100 you merge into segements of 1000X etc.