I needed to grab the changes from a site which didn't have either RSS or other way to inform you about it (classic gov site). So, what I created and I am about to show here, it is a basic way to parse a specific div from a website and if it changes, send an email to inform me.
So here is the script:
So, a bit explanation. This is just a part of a ruby cronjob from a rails application. The first line is a setting model with the logic of EAV . So it is a basic key / value with jsonb as value for keeping previous and current state.
Then I used faraday to do the get request. You can do it with many other ways. But faraday is a nice gem, I think it is the best for this kind of work. Just be careful with versions because you may find documentation for 2+ but for the time of this writing, you may get version 1.8.x. of the gem as a default. After that I used nokogiri with utf-8 encoding to be sure I'll parse it correctly.
The next line is using xpath to find the exact position in the html of the part you are interested in. You may need to do some research for the xpath, browsers nowadays gives you easily the full xpath (each browser has it's own way). Βut in some cases you may need to go one by one and find the full xpath.
Then after it gets the body content of the div, the rest is straightforward. It checks if it is the same as before and if it isn't, it stores the previous state, the current state and it sends an email. You can add this method in a cronjob or in a scheduler which it will trigger it every time you wish to check the website.
Then I used faraday to do the get request. You can do it with many other ways. But faraday is a nice gem, I think it is the best for this kind of work. Just be careful with versions because you may find documentation for 2+ but for the time of this writing, you may get version 1.8.x. of the gem as a default. After that I used nokogiri with utf-8 encoding to be sure I'll parse it correctly.
The next line is using xpath to find the exact position in the html of the part you are interested in. You may need to do some research for the xpath, browsers nowadays gives you easily the full xpath (each browser has it's own way). Βut in some cases you may need to go one by one and find the full xpath.
Then after it gets the body content of the div, the rest is straightforward. It checks if it is the same as before and if it isn't, it stores the previous state, the current state and it sends an email. You can add this method in a cronjob or in a scheduler which it will trigger it every time you wish to check the website.
Hope you find something useful in this post, I may create a rails application and start implementing this parser to show it in a more complete project. I am also add features like diff between previous and current state, get an screenshot of the page, allow to program specific steps to login or to follow before parsing etc.
Drop me an email if you read it and found it useful or if you have any question.