Note that there are some explanatory texts on larger screens.

plurals
  1. POPython OOP Project Organization
    primarykey
    data
    text
    <p>I'm a bit new to Python dev -- I'm creating a larger project for some web scraping. I want to approach this as "Pythonically" as possible, and would appreciate some help with the project structure. Here's how I'm doing it now:</p> <p>Basically, I have a base class for an object whose purpose is to go to a website and parse some specific data on it into its own array, jobs[]</p> <p><strong>minion.py</strong></p> <pre><code>class minion: # Empty getJobs() function to be defined by object pre-instantiation def getJobs(self): pass # Constructor for a minion that requires site authorization # Ex: minCity1 = minion('http://portal.com/somewhere', 'user', 'password') # or minCity2 = minion('http://portal.com/somewhere') def __init__(self, title, URL, user='', password=''): self.title = title self.URL = URL self.user = user self.password = password self.jobs = [] if (user == '' and password == ''): self.reqAuth = 0 else: self.reqAuth = 1 def displayjobs(self): for j in self.jobs: j.display() </code></pre> <p>I'm going to have about 100 different data sources. The way I'm doing it now is to just create a separate module for each "Minion", which defines (and binds) a more tailored getJobs() function for that object</p> <p><strong>Example: minCity1.py</strong></p> <pre><code>from minion import minion from BeautifulSoup import BeautifulSoup import urllib2 from job import job # MINION CONFIG minTitle = 'Some city' minURL = 'http://www.somewebpage.gov/' # Here we define a function that will be bound to this object's getJobs function def getJobs(self): page = urllib2.urlopen(self.URL) soup = BeautifulSoup(page) # For each row for tr in soup.findAll('tr'): tJob = job() span = tr.findAll(['span', 'class="content"']) # If row has 5 spans, pull data from span 2 and 3 ( [1] and [2] ) if len(span) == 5: tJob.title = span[1].a.renderContents() tJob.client = 'Some City' tJob.source = minURL tJob.due = span[2].div.renderContents().replace('&lt;br /&gt;', '') self.jobs.append(tJob) # Don't forget to bind the function to the object! minion.getJobs = getJobs # Instantiate the object mCity1 = minion(minTitle, minURL) </code></pre> <p>I also have a separate module which simply contains a list of all the instantiated minion objects (which I have to update each time I add one):</p> <p><strong>minions.py</strong></p> <pre><code>from minion_City1 import mCity1 from minion_City2 import mCity2 from minion_City3 import mCity3 from minion_City4 import mCity4 minionList = [mCity1, mCity2, mCity3, mCity4] </code></pre> <p>main.py references minionList for all of its activities for manipulating the aggregated data.</p> <p>This seems a bit chaotic to me, and was hoping someone might be able to outline a more Pythonic approach.</p> <p>Thank you, and sorry for the long post!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload