π° Good News app. Backend in Golang. Colly usage.
Letβs crawl some news websites and see how it could be fast and easy with Colly!
I am glad to see you on the third article out of six in this chapter.
If you are not familiar with what I am going to implement in this series of chapters, I would recommend you to read the introductory article:
All chapters of a βbookβ:
- π° Good News app. Backend in Golang behind Traefik reverse proxy with https available.
- [in progress] π° Good News app. Flutter for rapid mobile applications development.
- [in progress] π° Good News app. Hummingbird as a promising replacement for frontend frameworks.
And here are articles of current chapter:
- Prerequisites & Idea, project and database structure and API endpoints.
- Project creation, go modules & GIN (beautiful framework) integration.
- Colly usage.
- MongoDB setup using official Golang driver.
- Running all together locally with Docker and Docker Compose & Traefik v2.0 configuration.
- Publishing to Digital Ocean, Letβs Encrypt and DNS Challenge configuration.
So letβs start!
In this article, we will explore a powerful library for website crawling that is named Colly. Colly is a lightweight, yet fast and elegant scraping framework written in Go. The main goal of this scraper will be to fetch news posts from several pre-chosen websites as described in the first article.
This article will cover some basic usages of Colly library and a technique how to make crawling process smooth. We will pick one news website out of three and gather all desired information from it. Code of other sites crawlers will be presented as well so you can practice a little bit on your own.
Besides that we will create models for news, news source and news type.
Inspect HTML code and CSS selectors of a news website
I would like to apologise in advance that all news websites are going to be in Russian but our main goal here is to understand the technique of crawling.
We would like to extract the title, preamble, link, source, type and time when it was added from each website for each article. However as you can see for the website that we will work with, there is no any preamble of an article, so we will extract only title and link for it and source, type and time will be set by ourselves. Obviously, news source for this site is going to be Secret Mag
, type is news
and time, for our convenience, will be set as we crawl the site (we do it every 3 minutes).
With the help of the Google Chrome Inspector, we can determine the CSS selector for those elements. Article elements are wrapped within <div class="wrapper β¦">
, <div class="container β¦">
and then each element is wrapped with <div class="item β¦">
. So now we have a selectors hierarchy that has following structure .wrapper > .container > .item
and for each item we need to extract link and title (no preamble in this example). We can take a link that is inside <a class="link" href="β¦">
and a title is inside <div class="headline">
.
Name: .wrapper > .container > .item > div.headline
Link: .wrapper > .container > .item > a.link[href]
News source: SecretMag
News type: news
Time: <will_be_added_while_crawling>
Colly implementation
Before we start working on our crawler, letβs divide our entities to logical models. In this project as it was discussed in the first article of this chapter, we will need three collections in our database: news
, news sources
and news types
. So we have to create three models.
Create models/
folder and three files inside it: News.go
, NewsSource.go
and NewsType.go
.
I would like to say a few words about `json:"_id" bson:"_id"`
. So json
is telling us how ID
would look like when we return an output in .json
format and bson
is about how it will be presented in MongoDB.
You might have noticed that in NewsSource.go
and NewsType.go
files we have some extra information. It is predefined variables which will be used for our convenience while we will crawl websites and add new articles to MongoDB.
In Colly, you need first to implement a Collector
. A Collector
will give you access to some methods allowing you to trigger callback functions when a certain event happens. Colly is a very powerful tool and offers a lot of functionalities out of the box such as parallel scraping, proxy switcher, etc. However we are going to use only basics and exploit a few callback functions.
First of all, we have to download Colly library β type go get -u github.com/gocolly/colly/...
in terminal. So now we are good to go. We can start coding our first crawler. Letβs create crawler/
folder and create secret_mag.go
file inside it.
We create SecretMag
struct in order to have separate logic for our crawlers. Yes, we could put all the crawling code in one file but what we have now looks much more clean. Then we declare two constants with urls to the site we want to crawl. After that, we have Run()
function that will be entry point to start our SecretMag
crawler. As you may see there is NewsFunc
type that will be declared later in other file. It is used to shorten a function signature that is going to be used multiple times in our code for other crawlers as well. Next step is to loop all added functions, in our case itβs just one function runNews()
, and add result to total news gathered from the website. All magic with crawling is happening inside runNews()
function. I have decided to use this naming because in other crawlers, I use types such as business
and style
and therefore function names for them are runBusiness()
and runStyle()
. You might see it here and here.
In runNews()
function we create a Colly Collector
and then searching for any appearance of element with class .wrapper
as described before in the article. Then we use ForEach
function of colly.HTMLElement
to loop over all elements inside found element. We do the same with .container
element and looking for each .item
and extract desired information. Then we take title and link and generate _id
to make our new article instance unique in our database when we will be adding to MongoDB. After that we appending a newly created news instance to the array that contains all news fetched from one crawling process. In the end of runNews()
function, we visit the website and wait till a process will be finished.
Create hash.go
file under utils/
folder and paste the code below.
Thatβs it with crawling. All other crawlers are going to have the same structure. The only difference will be css selectors and if we want to crawl other types of news within the same website as I do it with this crawler. You can check it if you wish.
So as crawlers are going to be almost the same we can create interface, in other words, some rules which will be implemented by each crawler and then we could have one separate function as entry point to start all crawlers in background process at the same time. So that is why, create crawler.go
file under crawler/
folder and paste the code below.
Here we create interface NewsCrawler
with only one function Run()
which implementation we have seen in secret_mag.go
file. We have NewsFunc
type that was mentioned before. Then we use goroutine go startCrawler()
to start scraping process of all crawlers in background mode. We will call Start()
function only once before we run our server. startCrawler() contains simple logic of creating all crawlers, here you can see the power of interfaces, and running them. Also here we point that duration of each crawling process is going to be every 3 minutes. Then we loop over all crawlers and invoke their Run()
function which return desired news. And the next step will be to add all news to the database what we will be doing in the next article.
So now it is time to check how it all works together. In order to it you have to add this line crawler.Start()
to server/server.go
right before creating a NewRouter. Also set duration of each crawling process to 5 seconds in crawler/crawler.go
: duration := time.Second * 5
, and we need to add some prints inside the loop so we can see logs in the terminal: fmt.Printf("\n% #v \n", totalNews)
.
Thatβs it. Now we can build our project and run it.
> go build
> ./good-new-backend
Wait for 5 seconds and this is what you should see in terminal.
After you check that everything works as expected, please remove code for printing logs, change duration back to 3 minutes and add all missing crawlers from Github repository or you could write your owns. When you add missing crawlers, then add their initialisation to crawler.go
as well.
In order to stop server just press Ctrl + C
while your terminal is active.
I am glad that you have read this far π Below you will find the link to the next article.
If you have any comments or suggestions, please feel free to write them down or email me at batr@ggc.team π
If you would like to know when I post new articles, follow me on twitter π¦