如何提取网站上的所有PDF链接?

这有点偏离主题,但我希望你们能帮助我。 我发现了一个网站,里面装满了我需要的文章,但是这些文件夹杂着许多无用的文件(主要是jpgs)。

我想知道是否有办法在服务器上找到( 不下载 )所有PDF来制作链接列表。 基本上我只想过滤掉所有不是PDF的内容,以便更好地了解下载内容和不下载内容。

概观

好的,你走了。 这是一个脚本forms的程序化解决方案:

#!/bin/bash # NAME: pdflinkextractor # AUTHOR: Glutanimate (http://askubuntu.com/users/81372/), 2013 # LICENSE: GNU GPL v2 # DEPENDENCIES: wget lynx # DESCRIPTION: extracts PDF links from websites and dumps them to the stdout and as a textfile # only works for links pointing to files with the ".pdf" extension # # USAGE: pdflinkextractor "www.website.com" WEBSITE="$1" echo "Getting link list..." lynx -cache=0 -dump -listonly "$WEBSITE" | grep ".*\.pdf$" | awk '{print $2}' | tee pdflinks.txt # OPTIONAL # # DOWNLOAD PDF FILES # #echo "Downloading..." #wget -P pdflinkextractor_files/ -i pdflinks.txt 

安装

你需要安装wgetlynx

 sudo apt-get install wget lynx 

用法

该脚本将获取网站上所有.pdf文件的列表,并将其转储到命令行输出和工作目录中的文本文件。 如果您注释掉“可选” wget命令,脚本将继续将所有文件下载到新目录。

 $ ./pdflinkextractor http://www.pdfscripting.com/public/Free-Sample-PDF-Files-with-scripts.cfm Getting link list... http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JSPopupCalendar.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ModifySubmit_Example.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/DynamicEmail_XFAForm_V2.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcquireMenuItemNames.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/BouncingButton.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JavaScriptClock.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/Matrix2DOperations.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/RobotArm_3Ddemo2.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/SimpleFormCalculations.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/TheFlyv3_EN4Rdr.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ImExportAttachSample.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcroForm_BasicToggle.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcroForm_ToggleButton_Sample.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/AcorXFA_BasicToggle.pdf http://www.pdfscripting.com/public/FreeStuff/PDFSamples/ConditionalCalcScripts.pdf Downloading... --2013-12-24 13:31:25-- http://www.pdfscripting.com/public/FreeStuff/PDFSamples/JSPopupCalendar.pdf Resolving www.pdfscripting.com (www.pdfscripting.com)... 74.200.211.194 Connecting to www.pdfscripting.com (www.pdfscripting.com)|74.200.211.194|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 176008 (172K) [application/pdf] Saving to: `/Downloads/pdflinkextractor_files/JSPopupCalendar.pdf' 100%[===========================================================================================================================================================================>] 176.008 120K/s in 1,4s 2013-12-24 13:31:29 (120 KB/s) - `/Downloads/pdflinkextractor_files/JSPopupCalendar.pdf' saved [176008/176008] ... 

一个简单的javascript片段可以解决这个问题:(注意:我假设所有pdf文件都以链接中的.pdf结尾。)

打开你的浏览器javascript控制台,复制以下代码并将其粘贴到js控制台,完成!

 //get all link elements var link_elements = document.querySelectorAll(":link"); //extract out all uris. var link_uris = []; for (var i=0; i < link_elements.length; i++) { //remove duplicated links if (link_elements[i].href in link_uris) continue; link_uris.push (link_elements[i].href); } //filter out all links containing ".pdf" string var link_pdfs = link_uris.filter (function (lu) { return lu.indexOf (".pdf") != -1}); //print all pdf links for (var i=0; i < link_pdfs.length; i++) console.log (link_pdfs[i]);