Quickstart

Installation

  • pip install fulmar

Note: Redis is necessary, so make sure you have installed it.

Note: If you want to save the data you extracted from websites, please install MongoDB.

Please install PhantomJS if needed: http://phantomjs.org/build.html

Note: PhantomJS will be enabled only if it is excutable in the PATH or in the System Environment.

Run

  • fulmar all

Note: fulmar all command is running fulmar all in one, which running components in threads or subprocesses. For production environment, please refer to [Deployment](Deployment).

Your First Spider

from fulmar.base_spider import BaseSpider

class Handler(BaseSpider):

   def on_start(self):
      self.crawl('http://www.baidu.com/', callback=self.parse_and_save)

   def parse_and_save(self, response):
      return {
         "url": response.url,
         "title": response.page_lxml.xpath('//title/text()')[0]}

You can save above code in a new file called baidu_spider.py and run command in a new console:

fulmar start_project baidu_spider.py

If you have installed redis, you will get:

Successfully start the project, project name: `baidu_spider`.

In the example:

on_start(self)

It is the entry point of the spider.

self.crawl(url, callback=self.parse_and_save)

It is the most important API here. It will add a new task to be crawled.

parse_and_save(self, response)

It get a Response object. response.page_lxml is a lxml.html document_fromstring object which has xpath API to select elements to be extracted.

It return a dict object as result. The result will be captured into resultdb by default. You can override on_result(self, result) method to manage the result yourself.

More details you need to know:

CrawlRate

It is a decorator for the Handler class. It is used for limiting the crawl rate.

You can use it just like:

from fulmar.base_spider import BaseSpider, CrawlRate

@CrawlRate(request_number=1, time_period=2)
class Handler(BaseSpider):

   def on_start(self):
      self.crawl('http://www.baidu.com/', callback=self.parse_and_save)

   def parse_and_save(self, response):
      return {
         "url": response.url,
         "title": response.page_lxml.xpath('//title/text()')[0]}

It means you can only send requests_number requests during time_period seconds. Note that this rate limitation is used for a Worker.

So if you start fulmar with n workers, you actually send requests_number * n requests during time_period seconds.