It looks like you're using an Ad Blocker.

Please white-list or disable AboveTopSecret.com in your ad-blocking tool.

Thank you.

 

Some features of ATS will be disabled while you continue to use an ad-blocker.

 

Help with short batch file to archive a full ATS thread as a PDF file??

page: 1
7
<<   2 >>

log in

join
share:

posted on Sep, 20 2022 @ 11:43 AM
link   
I'd appreciate some help with creating a short batch file to archive a full ATS as a PDF file.

My batch file programming skills are a bit rusty (and were never that great - I am just a lawyer after all...) so working out the exact syntax for this batch file would take me an embarrassingly long amount of time. Some pointers would be very welcome.

As some of you may know, I've been doing some work in recent years to archive a lot of UFO material in PDF format to help preserve it and make it more easily searchable.

I've recently used a very, very short batch file containing a loop to use wkhtmltopdf to convert each html file in a directory to a PDF, at a speed of between 5 and 10 times faster than Adobe Acrobat. The real work is done by the (free) wkhtmltopdf software. This very short and simple batch file is just executed from within a folder of html files to cycles through the files and convert each one to a PDF file and save them in a stated directory:



@echo off
for %%i in (*.htm) do "C
rogram Fileswkhtmltopdfbinwkhtmltopdf.exe" "%%i" "E:temppdfsfromhtml%%~ni.pdf"


This very short batch file was helpful in creating an archive of over 200,000 pages of the defunct Open Minds Forum website (which I've made freely available online, with permission of its former owner):
isaackoiup.blogspot.com...



I think that a similar loop could be used to obtain each page in the (sequentially numbered) webpages in a long ATS thread - perhaps combined with allowing the user to specify the relevant thread number and the number of pages in the thread (unless the latter could be replaced with some sort of error checking that would terminate the batch file upon a URL being reached which does not exist, e.g. when the batch file seeks to obtain page 11 of a 10 page ATS thread).

So, each URL for a thread on ATS contains a thread number and ends with "pg1" then "pg2" then "pg3" etc. See, for example, one of my multi-page threads at:
www.abovetopsecret.com...

Presumably a batch file could seek (I'd guess in just a few lines...):
(1) input of the thread number (presumably using the set /p command to set a variable and ask for the input)
(2) input of the highest page number (again presumably using the set /p command to set a variable and ask for the input),
(3) then loop through the sequential URLs for that thread from 1 up to the highest number, using wkhtmltopdf to get each URL and save it to the same PDF file (and ideally name the file based on the ATS thread number). This is the step in relation to which I'm particularly unsure of the relevant syntax.

For ease of reference, instructions for the usage of wkhtmltopdf are on the wkhtmltopdf.org website at the link below.
wkhtmltopdf.org...

Any help would be most welcome...



posted on Sep, 20 2022 @ 12:35 PM
link   
a reply to: IsaacKoi

I am so sorry I can not offer any technical support but I do offer my gratitude to your work and amazing threads on these very forums and I hope someone can help you ensure all that work and research is able to be stored.

Again many, many thanks for the UFO threads over the years, you are in my opinion a legend an ATS Stalwart

Good luck



posted on Sep, 20 2022 @ 02:15 PM
link   
a reply to: IsaacKoi

i cannot help with that script as-is because i think its a waste of both of our time tbh. if i were assigned this task, i would probably use the pdfkit package for python which is a wrapper for wkhtmltopdf and implement an html sanitizer. both of these packages are on pypi

pypi.org...
pypi.org...

combined they will likely produce the best output for you.

additionally i would dump the ats page numbers unless you intend to save each page as its own pdf document for your own indexing purposes. also dont forget that most iterators are zero-based, i.e. page 1 is 0 in an indexed array.

no doubt I could script this but probably not this month as i am buried in my own projects.

eta: did not mean to be rude above. just saying its likely a waste of our time if you do not sanitize the html which i dont think you can do as easily from command line as you could with a python script. if you truly cannot figure it out and cannot get the help you need ill consider doing a one-off for you with no support whatsoever.
edit on 20-9-2022 by drewlander because: clarification

edit on 20-9-2022 by drewlander because: (no reason given)



posted on Sep, 20 2022 @ 02:59 PM
link   

originally posted by: drewlander
i would probably use the pdfkit package for python which is a wrapper for wkhtmltopdf and implement an html sanitizer.


I'll readily defer to your opinion that your preferred solution is more desirable - but unfortunately your preferred solution is beyond me and I'm not sure I'll get anyone to do the work on that, whereas the batch file option is something I can probably figure out myself tomorrow (with liberal use of Google to work out the precise syntax...) with ideally a few pointers on the syntax to limit the amount of Googling I'd need to do. I'd rather have something imperfect than nothing at all.

Basically, I think the batch file will be just something along the lines of the following (please excuse the mix of syntaxes/pseudocode but my putting forward an extremely rough draft may help get this done...):

@echo off
SET /p thread_number = "Please enter the relevant thread number "
SET /p page_end = "Please enter the number of the final page in that thread"

For counter = 1 to page_end do the following:
SET page_url = "https://www.abovetopsecret.com/forum/thread" + thread_number + "/pg" + counter
Do "/wkhtmltopdfbinwkhtmltopdf.exe" page_url ("ATS thread" + thread_number + ".pdf")
Next counter
edit on 20-9-2022 by IsaacKoi because: (no reason given)



posted on Sep, 20 2022 @ 03:13 PM
link   
a reply to: IsaacKoi

If I have any time this evening I will take a shot at scripting it properly against a short multi-page thread and get back to you. I notice you reference python ( my suggestion on this one ) and selenium in your footer. Didnt read much into it but I will later. Usually I would only use selenium scripts to automate logins etc. Kind curious what you are doing there now too. But I must stay focused.
edit on 20-9-2022 by drewlander because: footer



posted on Sep, 20 2022 @ 03:15 PM
link   

originally posted by: drewlander
a reply to: IsaacKoi

If I have any time this evening I will take a shot at scripting it properly against a short multi-page thread and get back to you.


Cheers. I hope to hear from you.



posted on Sep, 20 2022 @ 05:52 PM
link   
a reply to: IsaacKoi

Someone created software that can do all that for you

www.makeuseof.com...



posted on Sep, 20 2022 @ 09:42 PM
link   
a reply to: IsaacKoi

ats frowns on backslash so you will need to make an assumption in the file path there. This is not clean, but it will iterate starting at 1 to the upper limit of 3 in this case. I could easily parameterize this another day but for one-offs this would work. note that I have this saving each page to a new file but it could be easily modified to append to a single document im sure.

@echo off
setlocal enabledelayedexpansion
for /l %%a in (1,1,3) do (
set "n=%%a"
"C
rogram Fileswkhtmltopdfbinwkhtmltopdf.exe" www.abovetopsecret.com...!n:~-2! 1318435_!n:~-2!.pdf
)

this is based on other peoples code since I dont specialize in batch. let me know if you need an adjustment or additional parameter.

also note I just grabbed a random thread to make sure it works. you can swap out your own thread id and page upper limit.

ETA: also I ran this from an elevated prompt. If it does nothing you probably are not running command prompt as administrator.


edit on 20-9-2022 by drewlander because: (no reason given)

edit on 20-9-2022 by drewlander because: clarification

edit on 20-9-2022 by drewlander because: admin prompt



posted on Sep, 21 2022 @ 04:01 AM
link   
It's not a script, but I have a small VB.Net program that does more or less that, so it will be easy to change it to achieve what you want.

Unfortunately, I don't have much free time to do it during the week.

PS: being a VB.Net program, it only works in Windows, at least for the moment.



posted on Sep, 21 2022 @ 06:44 AM
link   

originally posted by: drewlander
This is not clean, but it will iterate starting at 1 to the upper limit of 3 in this case. I could easily parameterize this another day but for one-offs this would work.


Cheers, that's very helpful of you. I was able to run your script and it worked fine to download each page as a new file. I was also able to parameterize it myself (mainly just to show that I don't expect other people to do all the work...).

I'll paste the parameterized version below.


@echo off
setlocal enabledelayedexpansion
set /p thread_number=Please enter the relevant thread number:
set /p page_end=Please enter the number of the final page in the thread:
for /l %%a in (1,1,%page_end%) do (
set "n=%%a"
"C
rogram Fileswkhtmltopdfbinwkhtmltopdf.exe" www.abovetopsecret.com...!n:~-2! thread%thread_number%_!n:~-2!.pdf

)

I've now used this code to download some ATS threads as PDFs.

HOWEVER, when trying to understand the syntax of the code you helpfully provided (mainly in an attempt to amend it to create a single PDF for a thread), I came across a webpage at the link below that combines wkhtmltopdf and arrays. That page includes the following (unfortunately in relation to a UNIX bash script, so I can't just use this syntax from the command line or a batch file):

unix.stackexchange.com...



readarray -t a < file
wkhtmltopdf "$[a[@]]" all.pdf

readarray reads the file into an array line by line, -t removes the trailing newline.

"$[a[@]]" refers to all array elements. This generates a command in the form:

wkhtmltopdf "$[a[0]]" "$[a[1]]" "$[a[2]]" "..." all.pdf


This suggests that (at least with Unix and presumably other scripting languages...) there may be a very simple solution using parameters for wkhtmltopdf (although I've searched and searched without finding anyone discussing the potential use of parameters to download sequential URLs).

As a bit of a hodge-podge of a potential means of creating a single PDF directly within wkhtmltopdf (which may result in more internal links working than simply merging the separate PDF files) I've also briefly tried using your loop code above in a batch file to create an array and then use the @ symbol in relation to the array as a parameter for wkhtmltopdf (in the same way that the Unix code at the link above purportedly works). I think I may be close to it working, but I simply can't get my head around the syntax for the final line which calls the array. (Of course, if this works then it may be possible to avoid a batch file although and just use parameters when calling wkhtmltopdf).

Provided you can all refrain from laughing, I'll paste below my attempt at coding the loop creating an array.

@echo off
setlocal enabledelayedexpansion
set /p thread_number=Please enter the relevant thread number:
set /p page_end=Please enter the number of the final page in the thread:
for /l %%a in (1,1,%page_end%) do (
set "n=%%a"
set list[%n%]=https://www.abovetopsecret.com/forum/thread%thread_number%/pg!n:~-2!
)
"C
rogram Fileswkhtmltopdfbinwkhtmltopdf.exe" [%list[@]%] %thread_number%.pdf

I hope it's now just a matter of getting the batch file syntax right in the middle of the final line when referring to all the elements of the array (assuming there is an equivalent to the brief Unix command/parameter above and it's just a matter of getting the syntax right...).


edit on 21-9-2022 by IsaacKoi because: (no reason given)



posted on Sep, 21 2022 @ 06:53 AM
link   

originally posted by: Spacespider
Someone created software that can do all that for you

www.makeuseof.com...


Thanks. That webpage refers to several very useful tools for archiving websites that I've used over the years (particularly HTTrack and wget), but I'm not aware of a way to get them to do what I want here.



posted on Sep, 21 2022 @ 07:49 AM
link   
While the Bash script/command in relation to arrays that I mentioned above suggests that there may well be a method of running wkhtmltopdf in a way which creates a single PDF from the relevant set of html files (probably with working internal links), for now (again, mainly to show that I'm prepared to do some work myself...) I have downloaded the free pdftk toolkit from the link below (which includes a command line interface) and added a couple of lines to the code I posted above. The short additions now use pdftk to merge the separate PDFs and (after creating the merged PDF) deletes the separate files.
www.pdflabs.com...

@echo off
setlocal enabledelayedexpansion
set /p thread_number=Please enter the relevant thread number:
set /p page_end=Please enter the number of the final page in the thread:
for /l %%a in (1,1,%page_end%) do (
set "n=%%a"
"C:Program Fileswkhtmltopdfbinwkhtmltopdf.exe" www.abovetopsecret.com...!n:~-2! thread%thread_number%_!n:~-2!.pdf

)
pdftk *.pdf cat output "ATS thread%thread_number%.pdf"
del thread*.pdf



This code now creates a single PDF file of an ATS thread (and could easily be adapted to download a thread from other discussions forums or, indeed, any sequential set of URLs).

Thanks again to those that commented in this thread, particularly to Drewlander.



posted on Sep, 21 2022 @ 09:34 AM
link   
a reply to: IsaacKoi

Nice work. Definitely im not a batch expert but as a software engineer i know the part you were unclear about is a tricky one to grasp sometimes - declaring a variable outside a loop then changing and using it within. With 20 years of experience even I have had my battles with these situations but usually its more complicated like a recursive iterator iterator (multidimensional) of unknown depth to programmatically define a variable for an xpath search and replace in an xml document using domdocument and xpath->query or something. I wish all my work was on this level some days but i simultaneously acknowledge that I would get bored very quickly.

BOTTOM LINE: You did just fine. I appreciate you making attempt to learn and understand what you are doing instead of just asking someone to code for you. If you get stuck again lmk.



posted on Sep, 21 2022 @ 11:09 AM
link   
a reply to: ArMaP

I would have preferred to do this in literally anything other than fake-dos. I even considered coding a PS1 ( powershell ). After much thought however I resolved that I will supply the batch modification to achieve the main objective, and suggest that python is a better solution if for no other reason than it is cross-platform compatible. Given we now have Windows Sub-system for Linux ( WSL ) it would have probably also been better to write as bash script and just provide instruction to install WSL and any dependencies.

btw, vbs is the second scripting language I learned after USCBlogo ( basically you can call it lisp for all intents and purposes ) back in grade school. I never use it for anyting practical these days though because its not as friendly across multiple platforms.



posted on Sep, 21 2022 @ 05:41 PM
link   
a reply to: drewlander

Cant edit after 4 hours. UCB Logo I meant. First language I learned. I wouldnt recommend it as a starter language today tho. Python is arguably the best language fir beginners today, in my humble opinion.



posted on Sep, 23 2022 @ 01:28 PM
link   

originally posted by: drewlander
Python is arguably the best language fir beginners today, in my humble opinion.


I've been learning a tiny bit of Python during the last couple of years, but batch files are more within my comfort zone (albeit fairly limited and very rusty). I'd possibly still remember how to do a tiny bit of coding using PASCAL or BBC BASIC...



Every little project like this helps me learn something new for the next one. For example, I may revisit the issue in my footer that I posted a while back in the light of a bit of recent reading. (That reading resulted in my coming across something new to me which probably is the basis of a solution. Don't laugh, but I don't think I'd come across the terms "piping" and "the pipe" before. I think piping may even enable me to solve that problem myself, possibly even just with a few lines in another small batch file...).



posted on Sep, 25 2022 @ 07:59 AM
link   
Well, my tests didn't give the result I was expecting, so ignore me, at least for now.



posted on Sep, 25 2022 @ 11:42 AM
link   

originally posted by: ArMaP
Well, my tests didn't give the result I was expecting, so ignore me, at least for now.


The main thing I'd like from you ArMaP is some comfort that this work isn't going to annoy the moderators here on ATS...

I don't intend to duplicate all of ATS, but I would like to convert at least some threads to PDF for archiving purposes (and, ideally, share the resulting PDFs online for the few people that like to have PDFs regarding UFOs available on their hard drives for searching etc).



posted on Sep, 25 2022 @ 11:50 AM
link   
I'm pretty happy with how this is shaping up... I now have a working version of an improved solution.

I've managed to modify the code I posted above which asked for two variables to be input (a thread number and the number of pages).

The new version reads from a text/CSV file which lists a few details for various threads (in the format : thread number, number of pages, thread title), e.g.:
197741,2,Stargates are real
479045,3,UFO releases intelligent moving spheres

(I've entered low numbers of pages just as a test. Those threads are actually much longer. I've also not entered the full thread names).

So, it should be possible to fairly quickly (and possibly with others...) create a list of details for some of the top threads on ATS, e.g. using the function on ATS to sort the list UFO forum threads by number of flags (i.e. click on the link below):
www.abovetopsecret.com...

You'll see that I've done the taken the top 2 results in the very brief list above. It would probably only take a few more minutes to do the top 10 threads (or possibly top 50 or 100...).

The new code is:


@echo off
setlocal enabledelayedexpansion

for /f "usebackq tokens=1-3 delims=," %%A in ("thread_details.csv") do (
echo %%A %%B %%C
rem %%A = thread number, %%B = number of final page, %%C = thread name
for /l %%a in (1,1,%%B) do (
set "n=%%a"
"C
rogram Fileswkhtmltopdfbinwkhtmltopdf.exe" www.abovetopsecret.com...!n:~-2! thread%%A_!n:~-2!.pdf

)
pdftk *.pdf cat output "ATS thread%%A - %%C.pdf"
del thread*.pdf
)



The output of this code is a set of PDF files of ATS threads which appear in a folder with names list this:
ATS thread197741 - Stargates are real.pdf
ATS thread479045 - UFO releases intelligent moving spheres.pdf


(I'll probably post a few complete threads to the free online UFO archive of PDF material I've been developing during the last decade, together with the batch file and existing text/csv file. If there is interest in a bit of collaboration, I could post an initial list and examples in - say - the UFO forum to see what other threads members may want converted)






edit on 25-9-2022 by IsaacKoi because: (no reason given)

edit on 25-9-2022 by IsaacKoi because: (no reason given)



posted on Sep, 29 2022 @ 06:57 PM
link   
a reply to: IsaacKoi

Looks good to me as long as the website code is trustworty.

Have you ever been so lucky as to receive an email advising you have missed a package you were not expecting? Or maybe that your banking credentials have been compromised and you need to follow a link to change your password immediately? Often times these scams embed a script into a PDF document that an unsuspecting person launches from their desktop. This gives bad actors elevated access to your computer.

I strongly recommend you look at the conversion options such as --disable-_javascript as a modest attempt to mitigate malicious scripting. If the document converts to your satisfaction while dumping extraneous and potentially unsafe code you win twice. Once in risk mitigation and again by reducing the overall size of the document.

If you need any more help msg me cuz i dont check historical threads often. If you cannot reach me try mandroid cuz he knows how to reach me outside of the site.

Also it looks like you are famous. 2 years ago anyway. Vice had a big writeup on you for archiving 50,00 hours of UFO podcast. (Threw your handle in a google search just out of curiosity)




top topics



 
7
<<   2 >>

log in

join