How to Crawl a Blog Easily
Last updated: Oct 5, 2019
There is a huge amount of information out there and it’s getting harder and harder to create unique and appealing content. The common excuse is that there is no time, skills, or energy.
But let’s focus on finding good performing blog posts from competitors, blogs you adore, or just to do a random analysis. This post is about using a web crawler without getting too much into technical details. Of all the tools out there, I’ve found Portia from the guys at Scrapinghub.com great because it is free, fast, and user-friendly.
Determine blog
Choose a blog for crawling. I’d select a blog that is organized, categorized, and standardized. But I’ve chosen the Scrapinghub Blog due to easy access.
Sign up to scrapinghub
Go to scrapinghub.com and sign up if you haven’t already.
Choose crawler
Choose Portia to create your first project.
Portia is quite straightforward. There are so many things that you can do. Play with it for a couple of minutes and you’ll get used to it in no time. Here’s an overview of the basic steps for using Portia in a basic level.
- Enter website URL
- Configure link crawling
- Set sample page
- Choose elements
- Publish project
- Run crawler
- Analyze data
Specify sections
Now, this is where things can get complicated. Don’t just crawl the complete page because it might be hard for you to smoothen and analyze data later on. Select the sections of information you need by clicking at the sections you want. Let’s go to our example and chose the title, date, author, no. of comments, description, and URL for each blog post.
Collect data
Go back to your Scrapinghub Dashboard and click the green button RUN. Great! We’ve extracted the entire blog in less than a minute which has 107 blog posts* with 8 fields.
*We’ve excluded the full content of each post because we just needed an overview. But we could’ve extracted the entire content if we wanted to.
I can’t imagine the time that I could’ve wasted if I’ve manually copied and pasted that much information!
Export data
We’ve finally come to the fun part. Let’s export our data as CSV, import to Google Spreadsheets and start analyzing the content.
Analyze data
Raw data is useless without questions. Let’s ask some to find out more about our data. Now, I only collected data that I was interested in. You may need to change your data collection methods to direct your analysis to your preferences.
Which blog post had the highest comments?
Who has created the most content?
What is the frequency of publishing posts at 2016?
Blog posts were published every 8 days in 2016. The calculation is done by getting the date difference between 2 following blog posts and doing a simple average with all the date differences for 2016.
Now that we know how to extract data by using Portia we can collect data from anywhere* that we’d like.
*A lot of websites block crawlers that aren’t trusted or from unknown sources. So, Portia might not work on some.
Is this legal
In my opinion, collecting publicly available data isn’t an issue. We’re not breaching any databases or collecting personal information nor we’re using it for commercial purposes. However, this is a different topic where it needs to be covered in the future.
Happy crawling!