Télécharger tous les fichiers PDF d’un site web
$ doc_crawler.py http://a.com > url.lst (1)
$ doc_crawler.py --download-files url.lst (2)
$ doc_crawler.py --download-file http://a.com/file.txt (3)
$ doc_crawler.py --accept=jpe?g$ --download --single-page --wait=3 https://a.com/a_page (4)
1 | Create a url.lst file that contains the links to all PDF, ODT, DOC, ZIP… files found while exploring recursively the pointed website |
2 | Download all the listed files (in a second time, after you checked that the list is good) |
3 | Download one by one the files that failed for instance |
4 | Download all the photos from a unique page web gallery, all found photos are directly downloaded on the fly, the random wait between two requests is never longer than 3 seconds |