|
|
We can't resolve "read past EOF" exception. Everything is working perfectly in emulator. This is the way we are using AzureDirectory in Azure WCF service: public class IndexSearchWebService : IIndexSearchWeb Service { private static AzureDirectory m_iUnitAzureDir ectory = new AzureDirectory( storageAccount, "catalog_name", FSDirectory.Get Directory(@"loc ation_of_index_ local")); public List<ResultType > Search(string keyWord) { List<ResultType > list = new List<ResultType >(); IndexSearcher searcher = searcher = new IndexSearcher(m _iUnitAzureDire ctory); MyQuery query = new MyQuery (); query.WithKeywo rds(keyWord); Hits hits = searcher.Search (query.Query); MyResultDefinit ion resultDef = new MyResultDefinit ion(); ... return list; } } Exception is thrown at "new IndexSearcher(m _iUnitAzureDire ctory)". Index files are normally copied to the server from storage. Thanks a lot for your help, Alen
hmm, I don't know what you are hitting. Perhaps you have corrupt index?
Hello! I have a strange issue in my web application. I have a web-site that uses Azure queue to create queue of objects to updated. And I also have a Worker Role to update indexes. It works as expected. After the role updated the index i can see new files with updated files in the Blob. But when I try to read the index via Azure Directory it gives me the old data. So i have updated items in the index and old on the web site UI. I am pretty sure that happens because of the caching issues (because everything woks well on the dev Azure emulation). Can you please give me a clue of where to look ? Thank you!
When the IndexSearcher grabs an instance of the IndexReader it gets a snapshot of the state of the index. If you don't recreate the IndexSearcher you never see the index change. Just periodically refresh the IndexSearcher and I think it will solve your problem.
We have tried to integrate Lucene.NET and AzureDirectory into our ASP MVC project on Azure. The search is working OK but we've had a lot problems with generating and maintaining our index: - If we instantiate AzureDirectory and IndexSearcher per request everything is very slow - After we tried to keep AzureDirectory and IndexSearcher as a Singleton then we got a »read past EOF« exception Since we couldn’t get AzureDirectory to work, we have put the index files into the “siteroot” directory of our webpage without using AzureDirectory (the index is generated directly on the WebRole). The main problem of this solution is that every few hours or days (it’s random!) Azure seems to delete our index files on the WebRole and as a result, the index needs to be regenerated and some of the functionalities of our site isn’t working for the duration of the index generation. Do you have any solutions with the “EOF exception”? Do you have any advice on the architecture or some best practices on how to integrated Lucence.NET (& AzureDirectory) with ASP MVC projects on Azure? Thanks a lot for your help, Alen
AzureDirectory should be a singleton. You do not want to recreate it on every request. Make sure you are using azure local storage and a FSDirectory as your local cache. Lucene.NET throws EOF exceptions as part of it's normal execution. Were you seeing unhandled exceptions? If not then it is a normal part of it's operation. Normally you create a new IndexSearcher everytime you want to have a fresh view of the data. It is relatively cheap but not free to create a new IndexSearcher, so depending on your data you may decide to do it every request, once a minute, once an hour or whatever.
Hi, How do we rank data with Lucene.Net? Can I influence the results by overwriting the default algo? Thanks in advance. Mifla
There is copious amounts of information about using Lucene for ranking, including an excellent book which you can get on amazon.
Hi, I have a web role and a WCF service which has a RAMDirectory as cache. My problem is when the WCF-service is not hit by a request (say in 1 hour) IIS just kills the service, and the instance of the AzureDirectory is gone and I have to instanciate it again. It takes 1.5 minute to get everything from the blob to the RAMDirectory again. How can I solve this? I have tried to serialize the RAMDirectory and deserialize it and use that instance then creating the AzureDirectory but it does not use my serialized cache. Thank you very much!
You can triple cache it like this: AzureDirectory azureDirectory = new AzureDirectory("MyIndex", new RAMDirectory(ne w FSDirectory(@"c :\myindex"))); This essentially uses ramdirectory as your cache over the local file system cache of the remote data. This should save you from having to fetch the segments on every start. That said, I don't know how to prevent IIS from killing the service...that' s outside the scope of this forum.
Hi again, I use version 2.9.4 and the constructor for FSDirectory is internal, what version are you using?
I am looking at creating a lucene.Net azure solution that involves a front end that does the searching and file uploads to blob storage and then en-queues a ref to the file for a worker role (Indexing Component) to dequeue. I was thinking i would install the Microsoft Office 2010 Filter Pack and Adobe PDF IFilter as startup tasks (assuming they have silent installs) during the deployment of the worker role. To index a file the role would first locate the appropriate native .dll that contains the IFilter based on the file extension and then load it. From there call GetChunk on the IFilter to process the document. Is it possible to use an ifilter without having to write the file locally (i.e. from a stream) as the files will already be in blob storage? After the string is built up, the role would then use AzureDirectory to index the document. Is there a better way to do any of this?
Hi, Just as "k.c.s." brought up, I have tried manually upgrading to Lucene.NET 2.9.4 and it worked surprisingly well and easy. I am now running some tests and all works well so far. On the other hand, I have hit a serious issue (for me) that I am sure is purely Lucene.NET related and a consequence of an upgrade. I have a specific application where my readers refresh index fairly often so it is utterly important for me that the Lucene index occupies as few files as possible. With Lucene.NET 2.9.2 I have managed to bring index file count to 3 or 4 while with the upgraded Lucene.NET 2.9.4 the entire index spans 10 files. Believe it or not this is a severe performance hit as requesting 10 files from storage versus 3 makes a few seconds difference despite the fact that the total file size is identical. These tests are from a local debug environment. I have yet to deploy to staging to test there. I am using the following to minimize file count in Lucene.NET 2.9.2: writer.SetUseCompoundFile(true ); writer.Optimize (1, true); So, these give 3 files in older versus 10 in new Lucene.NET. Is there another way I can control this any better? Ideally, I'd prefer to have the entire index in one solid file but that is probably impossible. Any ideas? Thanks!
I think you are doing exactly the opposite of what you need to be doing. Refreshing the readers often is no problem, but why do you think that having the lucene index be as few files as possible is important or even desirable? You will end up causing the entire index to be replicated and short circuit the whole incremental nature of indexing. If you want one file you should just drop using AzureDirectory and copy it around yourself but I seriously doubt it will be as fast as turning off compound file and optimizing rarely. Read my post on the serious perf problems that compound files and optimizations cause with AzureDirectory and distributed nodes, and then read how Lucene incremental segment index files work.
Thanks for your answer and once again, thanks for your great work. As for my striving towards compound file, this comes purely from first hand experience while debugging my test application. Consider an index of e.g. 5MB. This is not too large. My index will never grow insanely large. Perhaps 20-30MB max. Anyway, my tests are plain and simple. I refresh readers quite often and they need several seconds (3-5) just to pull the updated index (when not in compound mode - approx. 10 files). On the other hand, when I make the index compound (consisting of only 3 files), I get almost 10x the performance increase in terms of pulling the updated index in the web role even with the same amount of bytes. So if you can tell me how to refresh the readers without having 3-5 seconds delay on each refresh, I would love to ditch the compound approach. With the latter I am getting < 500ms refresh of each reader simply because there are fewer files. Not to mention this all happens on a local fairly quick PC with all Azure services running in the same box. I have not run any tests on the Azure itself, but I am afraid to even do so with multi-second reader latency. So, to wrap up. I *know* I am doing the opposite of what you recommended (yes, I have read that part) but what I do now is giving me the best reader performance even if the writer is suffering a bit because of its compound index. I think I can live with writer delays if readers are super fast. Perhaps my debug environment is giving me wrong figures so my assumptions are all wrong? I'd appreciate any further comments you may have. Thanks!
I recommend setting #defines: FULLDEBUG and COMPRESBLOBS This will output to debug window every time a segment is opened from the cache, downloaded from storage etc. I would turn off compounds files and optimizations and see what happens. What you should see is that only the modified segment is downloaded when it changes. If you see all segments being downloaded everything then you are doing something wrong. You don't need to create a new IndexReader() everytime. All you should do is to create a new IndexSearcher() over the existing IndexReader(AzureDirectory()) instance. (older versions of lucene this used to not be the case, but I think >2.X you can just keep using the single indexreader.) Let me know what you find out, but if you turn on the debug statements you will have a clear picture of your your code is interacting with the remote blob storage. You also should not be recreating the AzureDirectory or RamDirectory/FS Directory() instances either. It should be something like: AzureDirectory azureDirectory = new AzureDirectory( "MyIndex", new RAMDirectory(ne w FSDirectory(@"c :\myindex"))); IndexReader reader = new IndexReader(ad) ; while(true) { IndexSearcher searcher = new IndexSearcher(r eader); // use it for X time // ... }
Looking in one of our projects the way we use it is to just grab the reader inside of the IndexWriter like this: private void _refreshIndexer() { _searcher = new IndexSearcher(_ indexWriter.Get Reader()); }
Hi, is there anyway to use Luke to view the index? I've downloaded the index to my local computer but when I point Luke to it it shows the message "No valid directory at the location, try another location." Thanks
I've just tried opening the indexes with Luke in the emuulator's local store and some of them worked so I think I've broken something somewhere.
Ah, the blobs in blob storage or normally compressed. If you point luke to the local cache then you definitely inspect them. Or you can turn off the compression.
Thanks, I can read all the indexes on my local store by turning off compression. Also, the file name underscores got removed when I downloaded the files from blob storage with Azure Storage Explorer. Luke could read them after renaming them.
Hi, I have been playing around with AzureDirectory a bit and it is quite cool and really fast. I am having one issue though when creating a new directory. When I call the following: IndexWriter writer = new IndexWriter(_directory, GetAnalyzer(), !indexExists); for a directory that does not yet exist I always get a FileNotFoundExc eption on segments.gen (AzureDirectory .cs line 260). Has anyone else experienced this, or is there something I am missing when creating a new directory? Cheers.
Lucene.NET uses file not found exceptions as part of it's normal path, so this is normal.
Thanks for clarifying this. I was getting the file not found exception as well on the first run (you can duplicate this by deleting the files from local cache directory and then running with a new catalog name).
Hi I am getting the same exception over and over while trying to run the sample TestApp. I have modified app.config to use my storage account. The aforementioned line keeps throwing file not found on "segments.gen" for me as well. Any idea how to get past this error? Following is the stacktrace: at Lucene.Net.Store.Azure.AzureDi rectory.OpenInp ut(String name) in C:\Users\shives hr\Desktop\Azur e Library for Lucene.Net (Full Text Indexing for Azure)\C#\Azure Directory\Azure Directory.cs:li ne 274 at Lucene.Net.Inde x.SegmentInfos. FindSegmentsFil e.Run() could someone please help me get past this.
also, when i create a blob myself (tried with both null string and junk string), i start getting an exception thrown at a different location: {"read past EOF"}
at Lucene.Net.Stor e.BufferedIndex Input.Refill()
at Lucene.Net.Stor e.BufferedIndex Input.ReadByte( )
at Lucene.Net.Stor e.Azure.AzureIn dexInput.ReadBy te() in C:\Users\shives hr\Desktop\Azur e Library for Lucene.Net (Full Text Indexing for Azure)\C#\Azure Directory\Azure IndexInput.cs:l ine 194
at Lucene.Net.Stor e.IndexInput.Re adInt()
at Lucene.Net.Inde x.SegmentInfos. FindSegmentsFil e.Run()
as mentioned before, Lucene.Net uses File not found and read past EOF exceptions as part of normal operations. Changed your debugger to not stop on those exceptions.
I'm a newbie with Lucene and Azure, and trying to figure out how to use AzureDirectory. If I have multiple instances of my mvc app on azure, each with it's own IndexWriter (without using any queue solutions): 1. Does AzureDirectory automatically make the IndexWriter not update(A) the index in azure blob if another instance is writing to it at the moment? 2. If 1 is true, what should I do to make sure that update(A) is eventually written to azure blob? 3. If 1 is false, how do i use AzureDirectory to implement something described here, http://code.msdn.microsoft.com /Azure-Library- for-83562538/so urcecode?fileId =18714&pathId=1 390456562 in the "Resolving the concurrent issue" section? thanks for your input
AzureDirectory implements a Lock() method which "locks" the blob storage for write. IndexWriter uses that lock method to get exclusive access. So yes, If you create an indexwriter on 1 machine the other machines are prevented from writing to the index. When you close the indexwriter the data is flushed to azuredirectory (and hence to blob storage) and then the indexwriter releases the lock. Long story short, multiple indexwriters on the same catalog doesn't work so well. a natural solution is to have a queue with one node which is responsible for updating the index and then N nodes with IndexSearch/IndexReaders consuming the index. Another solution which is a bit more involved it is to have multiple writer nodes and have node affinity for data coming in (aka a sharding solution). This usually gets complicated pretty quickly. My expreience is lucene can update the index around 10-30 docs a second depending on the complexity of your schema and so having 1 node doing the writing is good enough for most solutions.
i realized i put the wrong link in question 3. It should be http://blogs.msdn.com/b/window s-azure-support /archive/2010/1 1/03/a-common-s cenario-of-mult i_2d00_instance s-in-windows-az ure-.aspx It talks about implementing a wait and merge approach: private void checkForMerge() { if (_count > 10) { //check the whether the index is locked. try { if (locked…) return;//locked ,so return and wait for next turn… } catch (Exception) { } //do merging here… _count = 0; //zero the counter. } } I can utilize AzureLock on the AzureDirectory to obtain a lock from one instance. If successfull, then close the writer to flush the data to blob storage. If it's locked already, then just do a flush to write data to the local cache, and wait til next turn to flush to blob. Can you see anything wrong with this implementation here? Thanks