API Index¶

BaseSpider¶

It’s the main class.

You should use it in your script file like this:

from fulmar.base_spider import BaseSpider

Then you should create a new class which inherit BaseSpider just like:

class Handler(BaseSpider)

Note that the name of the class don’t have to be Handler.

CrawlRate¶

It is used for limiting the crawl rate. It is a decorator for the Handler class.

You can use it just like:

from fulmar.base_spider import BaseSpider, CrawlRate

@CrawlRate(request_number=1, time_period=2)
class Handler(BaseSpider):

   def on_start(self):
      self.crawl('http://www.baidu.com/', callback=self.parse_and_save)

   def parse_and_save(self, response):
      return {
         "url": response.url,
         "title": response.page_lxml.xpath('//title/text()')[0]}

It means you can only send requests_number requests during time_period seconds. Note that this rate limitation is used for one Worker.

So if you start fulmar with n workers, you actually send requests_number * n requests during time_period seconds.

def start(self)¶

In the class Handler, you should define a method called start. It’s the entrance of the whole crawl.

def crawl(self, url, **kwargs)¶

It’s used to define a new crawl task. You can use it like:

self.crawl('http://www.baidu.com/', callback=self.detail_page)

There are many parameters can be used.

url¶

URL for the new request.

method¶

method for the new request. Defaults to GET. (optional) Dictionary or bytes to be sent in the query string for the request.

data¶

(optional) Dictionary or bytes to be send in the body of request.

headers¶

(optional) Dictionary of HTTP Headers to be send with the request.

cookies¶

(optional) Dictionary to be send with the request.

timeout¶

(optional) How long to wait for the server to send data before giving up. Type: float.

priority¶

The bigger, the higher priority of the request. Type: int.

callback¶

The method to parse the response.

callback_args¶

The additional args to the callback. Type: list.

callback_kwargs¶

The additional kwargs to the callback. Type: dict.

taskid¶

(optional) Unique id to identify the task. Default is the sha1 check code of the URL. But it won’t be unique when you request the same url with different post params.

crawl_at¶

The time to start the rquest. It must be a timestamp. Type: Int or Float.

crawl_later¶

Starts the request after crawl_later seconds.

crawl_period¶

Schedules the request to be called periodically. The request is called every crawl_period seconds.

crawl_rate¶

This should be a dict Which contain request_number and time_period. Note that the time_period is given in seconds. If you don’t set time_period, the default is 1. E.g.,

crawl_rate={'request_number': 10, 'time_period': 2}

Which means you can crawl the url at most 10 times every 2 seconds.

Type: dict.

allow_redirects¶

(optional) Boolean. Defaults to True. Type: bool.

proxy_host¶

(optional) HTTP proxy hostname. To use proxies, proxy_host and proxy_port must be set; proxy_username and proxy_password are optional. Type: string.

proxy_port¶

(optional) HTTP proxy port. Type: Int.

proxy_username¶

(optional) HTTP proxy username. Type: string.

proxy_password¶

(optional) HTTP proxy password. Type: string.

fetch_type¶

(optional) Set to js to enable JavaScript fetcher. Defaults to None.

js_script¶

(optional) JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write("Hello World !"); }.

js_run_at¶

(optional) Run JavaScript specified via js_script at document-start or document-end. Defaults to document-end.

js_viewport_width¶

(optional) Set the size of the viewport for the JavaScript fetcher of the layout process.

js_viewport_height¶