|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
WebMagic网站http://webmagic.io/
参考https://www.oschina.net/code/snippet_1397325_35514
1.实现PageProcessor
- import java.util.ArrayList;
- import java.util.List;
- import us.codecraft.webmagic.Page;
- import us.codecraft.webmagic.Site;
- import us.codecraft.webmagic.processor.PageProcessor;
- import us.codecraft.webmagic.utils.UrlUtils;
- public class ImgProcessor implements PageProcessor {
- private String urlPattern;
-
- private Site site;
-
- private int key = 0;
-
- public ImgProcessor(){}
-
- public ImgProcessor(String startUrl, String urlPattern) {
- this.site = Site.me().setDomain(UrlUtils.getDomain(startUrl));
- this.urlPattern= urlPattern;
- }
-
- @Override
- public void process(Page page) {
- String imgRegex = "http://mm.howkuai.com/wp-content/uploads/20[0-9]{2}[a-z]/[0-9]{1,4}/[0-9]{1,4}/[0-9]{1,4}.jpg";
- List<String> requests = page.getHtml().links().regex(urlPattern).all();
- String imgHostFileName = page.getHtml().xpath("//title/text()").toString().replaceAll("[|\\pP‘’“”\\s(妹子图)]", "");
- List<String> listProcess = page.getHtml().$("div#picture").regex(imgRegex).all();
- //此处将标题一并抓取,之后提取出来作为文件名
- listProcess.add(0, imgHostFileName);
- page.putField("img", listProcess);
-
- page.addTargetRequests(requests);
-
- }
-
- @Override
- public Site getSite() {
- return site;
- }
- }
复制代码
2.实现Pipeline
3.爬图
- import us.codecraft.webmagic.Spider;
- public class ImgSpiderTest {
- public static void main(String[] args) {
- String fileStorePath = "E:\\webmagic-data\\test";//这里E盘中必须存在webmagic-data文件夹 文件夹中必须包含test文件夹 否则报错
- String urlPattern = "http://www.meizitu.com/[a-z]/[0-9]{1,4}.html";
- ImgProcessor imgspider=new ImgProcessor("http://www.meizitu.com/",urlPattern);
-
- //webmagic采集图片代码演示,相关网站仅做代码测试之用,请勿过量采集
- Spider.create(imgspider)
- .addUrl("http://www.meizitu.com/")
- .addPipeline(new ImgPipeline(fileStorePath))
- .thread(10) //此处线程数可调节
- .run();
- }
- }
复制代码
ImgProcessor中的"http://mm.howkuai.com/wp-content/uploads/20[0-9]{2}[a-z]/[0-9]{1,4}/[0-9]{1,4}/[0-9]{1,4}.jpg";可能会变,如果爬不到图片,可查看一下
|
评分
-
查看全部评分
|