fulmar¶
Fulmar is a distributed crawler system.
By using non-blocking network I/O, Fulmar can handle hundreds of open connections at the same time. You can extract the data you need from websites. In a fast, simple way.
Some features you may want to know:
- Write script in Python
- Task crontab, priority
- Cookie persistence
- Use Redis as message queue
- Use MongoDB as default database at present
- Support rate limitation of requests for a certain website
- Distributed architecture
- Crawl Javascript pages
Quick links¶
Script example¶
from fulmar.base_spider import BaseSpider
class Handler(BaseSpider):
def on_start(self):
self.crawl('http://www.baidu.com/', callback=self.detail_page)
def parse_and_save(self, response):
return {
"url": response.url,
"title": response.page_lxml.xpath('//title/text()')[0]}
You can save above code in a new file called baidu_spider.py
and run command:
fulmar start_project baidu_spider.py
If you have installed redis, you will get:
Successfully start the project, project name: "baidu_spider".
Finally, start Fulmar:
fulmar all
Installation¶
Automatic installation:
pip install fulmar
Fulmar is listed in PyPI and
can be installed with pip
or easy_install
.
Manual installation: Download tarball, then:
tar xvzf fulmar-latest.tar.gz cd fulmar-latest python setup.py build sudo python setup.py install
The Fulmar source code is hosted on GitHub.
Prerequisites: Fulmar runs on Python 2.7, and 3.3+ For Python 2, version 2.7.9 or newer is strongly recommended for the improved SSL support.
Documentation¶
This documentation is also available in PDF and Epub formats.