Scrapy (Screen Scraping and Web Crawling Framework)


Scrapy is an application framework for crawling web sites and extracting structured data. This finds its applications in data mining, information processing or historical archival processes. It provides a lot more features than most web crawlers out there and hence it makes itself up on PenTestIT. These are it’s current features:

 Built-in support for selecting and extracting data from HTML and XML sources
  • Built-in support for cleaning and sanitizing the scraped data using a collection of reusable filters (called Item Loaders) shared between all the spiders.
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
  • A media pipeline for automatically downloading images (or any other media) associated with the scraped items
  • Support for extending Scrapy by plugging your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).
  • Wide range of built-in middlewares and extensions for:
    • cookies and session handling
    • HTTP compression
    • HTTP authentication
    • HTTP cache
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
    • and more
  • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
  • Extensible stats collection for multiple spider metrics, useful for monitoring the performance of your spiders and detecting when they get broken
  • An Interactive shell console for trying XPaths, very useful for writing and debugging your spiders
  • A System service designed to ease the deployment and run of your spiders in production.
  • A built-in Web service for monitoring and controlling your bot
  • A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
  • Logging facility that you can hook on to for catching errors during the scraping process.
In addition to the above features, it works on Linux, Windows, Mac and BSD operating systems. It was programmed in Python with ease of use and simplicity in mind. Infact according to the author, Scrapy is used in production crawlers to completely scrape more than 500 retailer sites daily, all in one server! To use it properly on your system, you need to complete the following dependencies:
  • Python 2.5, 2.6, 2.7 (3.x is not yet supported)
  • Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of a Twisted bug)
  • lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
  • simplejson (not required if using Python 2.6 or above)
  • pyopenssl (for HTTPS support. Optional, but highly recommended)
You can easily configure Scrapy via the scrapy.cfg configuration file.
Download Scrapy 0.12 (tip.zip/tip.tar.gz) here.

SHARE OUR NEWS DIRECTLY ON SOCIAL NETWORKS:-

LINK TO OUR HOME PAGE :
Voice Of GREYHAT is a non-profit Organization propagating news specifically related with Cyber security threats, Hacking threads and issues from all over the spectrum. The news provided by us on this site is gathered from various Re-Sources. if any person have some FAQ's in their mind they can Contact Us. Also you can read our Privacy Policy for more info. Thank You ! -Team VOGH
If you enjoyed VOGH News, Articles Then Do Make sure you to Subscribe Our RSS feed. Stay Tuned with VOGH and get Updated about Cyber Security News, Hacking Threads and Lots More. All our Articles and Updates will directly be sent to Your Inbox. Thank You! -Team VOGH

Categories:
Related Posts Plugin for WordPress, Blogger...