Grimoire-
Command
.es

GNU+Linux command memo

doc_crawler : download all the PDF of a website

Télécharger tous les fichiers PDF d’un site web

$ doc_crawler.py http://a.com > url.lst (1)
$ doc_crawler.py --download-files url.lst (2)
$ doc_crawler.py --download-file http://a.com/file.txt (3)
$ doc_crawler.py --accept=jpe?g$ --download --single-page --wait=3 https://a.com/a_page (4)
1 Create a url.lst file that contains the links to all PDF, ODT, DOC, ZIP… files found while exploring recursively the pointed website
2 Download all the listed files (in a second time, after you checked that the list is good)
3 Download one by one the files that failed for instance
4 Download all the photos from a unique page web gallery, all found photos are directly downloaded on the fly, the random wait between two requests is never longer than 3 seconds

doc_crawler.py can be downloaded here, or installed via Pypi.