GNU+Linux command memo

doc_crawler : download all the PDF of a website

Télécharger tous les fichiers PDF d’un site web

$ > url.lst (1)
$ --download-files url.lst (2)
$ --download-file (3)
$ --accept=jpe?g$ --download --single-page --wait=3 (4)
1 Create a url.lst file that contains the links to all PDF, ODT, DOC, ZIP… files found while exploring recursively the pointed website
2 Download all the listed files (in a second time, after you checked that the list is good)
3 Download one by one the files that failed for instance
4 Download all the photos from a unique page web gallery, all found photos are directly downloaded on the fly, the random wait between two requests is never longer than 3 seconds can be downloaded here, or installed via Pypi.