Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Even though Rho's answer seems very good I thought I'd share how I got scrapy working with Django Models (aka Django ORM) <em>without</em> a full blown Django project since the question only states the use of a "Django database". Also I do not use DjangoItem. </p> <p>The following works with Scrapy 0.18.2 and Django 1.5.2. My scrapy project is called scrapping in the following.</p> <ol> <li><p>Add the following to your scrapy <code>settings.py</code> file</p> <pre><code>from django.conf import settings as d_settings d_settings.configure( DATABASES={ 'default': { 'ENGINE': 'django.db.backends.postgresql_psycopg2', 'NAME': 'db_name', 'USER': 'db_user', 'PASSWORD': 'my_password', 'HOST': 'localhost', 'PORT': '', }}, INSTALLED_APPS=( 'scrapping', ) ) </code></pre></li> <li><p>Create a <code>manage.py</code> file in the same folder as your <code>scrapy.cfg</code>: This file is not needed when you run the spider itself but is super convenient for setting up the database. So here we go:</p> <pre><code>#!/usr/bin/env python import os import sys if __name__ == "__main__": os.environ.setdefault("DJANGO_SETTINGS_MODULE", "scrapping.settings") from django.core.management import execute_from_command_line execute_from_command_line(sys.argv) </code></pre> <p>That's the entire content of <code>manage.py</code> and is pretty much exactly the stock <code>manage.py</code> file you get after running <code>django-admin startproject myweb</code> but the 4th line points to your scrapy settings file. Admittedly, using <code>DJANGO_SETTINGS_MODULE</code> and <code>settings.configure</code> seems a bit odd but it works for the one <code>manage.py</code> commands I need: <code>$ python ./manage.py syncdb</code>.</p></li> <li><p>Your <code>models.py</code> Your models.py should be placed in your scrapy project folder (ie. <code>scrapping.models´). After creating that file you should be able to run you</code>$ python ./manage.py syncdb`. It may look like this:</p> <pre><code>from django.db import models class MyModel(models.Model): title = models.CharField(max_length=255) description = models.TextField() url = models.URLField(max_length=255, unique=True) </code></pre></li> <li><p>Your <code>items.py</code> and <code>pipeline.py</code>: I used to use DjangoItem as descriped in Rho's answer but I ran into trouble with it when running many crawls in parallel with scrapyd and using Postgresql. The exception <code>max_locks_per_transaction</code> was thrown at some point breaking all the running crawls. Furthermore, I did not figure out how to properly roll back a failed <code>item.save()</code> in the pipeline. Long story short, I ended up not using DjangoItem at all which solved all my problems. Here is how: <code>items.py</code>:</p> <pre><code>from scrapy.item import Item, Field class MyItem(Item): title = Field() description = Field() url = Field() </code></pre> <p>Note that the fields need to have the same name as in the model if you want to unpack them conveniently as in the next step! <code>pipelines.py</code>:</p> <pre><code>from django.db import transaction from models import MyModel class Django_pipeline(object): def process_item(self, item, spider): with transaction.commit_on_success(): scraps = MyModel(**item) scraps.save() return item </code></pre> <p>As mentioned above, if you named all your item fields like you did in your <code>models.py</code> file you can use <code>**item</code> to unpack all the fields when creating your MyModel object. </p></li> </ol> <p>That's it!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload