fulmar¶

Fulmar is a distributed crawler system.

By using non-blocking network I/O, Fulmar can handle hundreds of open connections at the same time. You can extract the data you need from websites. In a fast, simple way.

Some features you may want to know:

Write script in Python
Task crontab, priority
Cookie persistence
Use Redis as message queue
Use MongoDB as default database at present
Support rate limitation of requests for a certain website
Distributed architecture
Crawl Javascript pages

Quick links¶

Script example¶

from fulmar.base_spider import BaseSpider

class Handler(BaseSpider):

   def on_start(self):
      self.crawl('http://www.baidu.com/', callback=self.detail_page)

   def parse_and_save(self, response):
      return {
         "url": response.url,
         "title": response.page_lxml.xpath('//title/text()')[0]}

You can save above code in a new file called baidu_spider.py and run command:

fulmar start_project baidu_spider.py

If you have installed redis, you will get:

Successfully start the project, project name: "baidu_spider".

Finally, start Fulmar:

fulmar all

Installation¶

Automatic installation:

pip install fulmar

Fulmar is listed in PyPI and can be installed with pip or easy_install.

Manual installation: Download tarball, then:

tar xvzf fulmar-latest.tar.gz
cd fulmar-latest
python setup.py build
sudo python setup.py install

The Fulmar source code is hosted on GitHub.

Prerequisites: Fulmar runs on Python 2.7, and 3.3+ For Python 2, version 2.7.9 or newer is strongly recommended for the improved SSL support.

Documentation¶

This documentation is also available in PDF and Epub formats.