So my original idea isn't going to work as planned. They have hardened the web servers directories pretty well. Im going to think about this now for a
SEO for reference stands for Search Engine Optimization. The module I have will allow me to spawn a "mini" crawler targeted at a directory or sub
directories from any domain and attempt to download and store a local copy.
Directory Browsing is disabled and since there are just raw images placed in the directory, there is no code cross linking any pages, thus it cant
I had posted my response before I even looked into it. This is a common method that we use at work to extract large amount of document caches from
various US agencies, so I thought the solution might work.
Sidenote: When I did the POC on this script I pointed it at the Mission Enterprise site. I was amazed at the amount of image content I received that
wasn't posted in any of the site content. Whats funny is that I forgot I had a roaming profile and that the default directory for saving was part of
the profile. It was pretty funny the next day when the backup admin was so confused in the morning as why the backup data sets grew so much. I forget
but with crawling all related content from that site was like 100+ GB, mainly due to high resolution image dumps. I could only imagine there usage
report for that day, I let it run overnight on 100 mbps fiber.
You mentioned you have blocks recorded, which I might be able to do something with in another manner. Could you do me a favor and post what a block of
10-20 images would look like, as in all uri's in a list? Chances are it can just be regex'ed on the pattern in there naming conversion and let
something lose on the site. It will be a stab in the dark, might not work, and could take a long time but I am game just cause its fun to do this
Im sure other ideas will come to me but I am Le Tired from working all day. I am off work for a month on StayCation as of today
so I have limited
resources as far as Bots and Enterprise Level Crawlers.
Keep the thread updated if you or anyone else makes progress so I don't waste my time. I don't post often but wanted in on this one, since I just so
happen to be in the business of data harvesting from public government sources / agencies. Legally that is, as its a public corp.
Best of luck!
Edit: My SEO trick does work on this URL so I am going to let it run for a few hours and see what kind of data I can mine out of it.
Can you provide me a link to where I can start on page one and read the content? You mentioned there are in the sites search results, so one of those
links should work.
edit on 4-8-2011 by Thereal because: (no reason given)
edit on 4-8-2011 by Thereal because: (no reason