All Projects

Webscraping projects

I used to do webscraping jobs using PHP.  I did around 5 or 6 jobs and started out on a couple of basic websites that could be scraped using PHP’s built in HTTP request functions.  Then I found out that a lot of sites use frontend frameworks to dynamically generate content.  Much of this JavaScript is also minified and obfuscated.  I needed something more powerful than a basic HTTP scraper to scrape these sites.  So I learned browser automation using PHP Laravel Dusk with Chromedriver.  This tool is part of the Laravel web application framework and is intended for testing your own web applications.  Obviously it can be used for many other purposes including webscraping.  It can also run in headless mode, which I’ve also used to set up a scraper to run on a cron job on a Digital Ocean VPS (my last job which I failed to deliver on.  More on that below.).

The reason I got into webscraping was because one of my strong areas is parsing and processing raw data.  I enjoy the challenge of “solving the puzzle” using logic and algorithms.  I also figured it could be something to do on the side as I’m learning the latest tools for web development.

The problems I ran into, and the reasons I no longer do webscraping are as follows:

  1. I wasn’t making enough money.  Most jobs ended up taking five or more times longer than I anticipated due to unexpected issues that would come up.  So a job that was supposed to take a few hours for $50 would end up taking me a few weeks.  Then I felt bad for taking so long, so a lot of times I’d finish it for free.  Also, I had no time to learn what I was originally trying to learn (web development).
  2. It was difficult to deliver something that was 100% working.  Since my code needs to be so tightly coupled to the target site’s code, any anomaly on the target site’s part will completely derail my scraper.  It could be thousands of pages in and if something is just a tiny bit off, the program crashes.  This was especially frustrating and a big part of why things always took much longer than I would originally anticipate.
  3. Clients demanded too much out of these scrapers.  Many of the clients wanted data to be updated way faster than the scrapers could work.  It’s unreasonable to expect thousands of pages to be crawled every five minutes on a shared hosting package or running on your local machine.  It can be done, but would require hundreds or thousands of dollars per month to have a network of servers dedicated to the task.  But they want me to do it on their $10/month shared hosting plan and wonder why their site gets shut down for overloading the network.  Unreal.
  4. Questionable legality and morality of many of the jobs.  A lot of these types of jobs fall inside a sort of grey area that I’m not always comfortable with.  Webscraping in essence is using a website(s) in ways that were never intended in order to harvest data for one reason or another.  I tried not to ask too many questions about what the client needed the data for, for this reason.  There was one job I did for these “HYIP” monitor sites that was like that.  When I got curious and looked into what it was all about, I learned that is was some sort of Ponzi scheme using cryptocurrencies.  I finished what the client wanted me to do and told him I didn’t want to work with him anymore.  That was the job that had 70 HTTP scrapers running simultaneously on a shared hosting account that got shut down.  I tried to explain to him the limitations of his hosting package, but he didn’t want to listen.  Besides, I feel like there’s no reason to do shady jobs like that with my skill set.  There are plenty of legitimate opportunities out there for me.  There’s no reason to be working with scummy people.
  5. Modern bot detection features from Google.  The last job I did was for a Craigslist scraper.  The client was a real estate company that wanted ads for rental properties scraped for general and contact details.  This is the job where I set up full browser automation running in headless mode on a Digital Ocean VPS.  It was a pretty elaborate operation that ultimately failed.  The scraper needed to be able to discern between owner and property managed properties as they were only interested in owner managed properties so they could offer to manage it for them.  The data was then sent to a Google spreadsheet that they used to manage leads.  The problem I ran into was the bot detection software that Craigslist uses: Google invisible recaptcha.  This is the latest version of captcha.  The old ones (photos of distorted text) can be more easily solved using an optical character recognition (OCR) library.  The lastest version however, is not so easy to get around.  It basically tracks your behavior as you browse a site and runs some risk analysis algorithm on Google’s servers for certain actions you take.  It’s pretty hard to fool it when you’re visiting every ad in every region of a state for a particular category and doing the same thing on every page.  I tried fooling it by making it switch proxy servers every so often, but Craigslist keeps track of and blocks many free proxies as well.  I suspect the bot detection software is also smart enough to not be fooled by swapping proxies.  I had to tell the client that I couldn’t deliver on the project.  And that’s when I decided to get out of webscraping entirely.

 

All Projects

Leave a Reply

Your email address will not be published. Required fields are marked *