How to Build an Automated System That Sends Summarized ‘MarketWatch’ Articles Using R

Mandula Thrimanne
Analytics Vidhya
Published in
6 min readOct 10, 2020

--

Source: MarketWatch

Being up to date on relevant news can be a daunting task in this information era. Everything from Elon Musk’s new initiatives to Kardashians’ new make-up lines can be overwhelming for any average human being to absorb. But excess availability of information doesn’t necessarily mean that we should just give up watching the news entirely. Rather, we should constantly try to gather and comprehend information that we think would be valuable to us from reliable sources on a daily basis. This is why Facebook, Twitter, YouTube and other social media giants hire data scientists to find out what exactly their users prefer to see on their feed to enhance their experience rather than flooding them with everything based on general popularity which would be highly ineffective in user retention. Therefore every time you make a decision as to whether to watch something or not, even simply to click on a post or not, you are unknowingly helping them to get to know you better (In a less creepy context, making their algorithm more accurate). Also, if you are currently having an identity crisis, I suggest you go to your YouTube home page and let it tell you exactly what kind of a person you are and your interests. And whether you like it or not, you can be very well certain that it’s not completely wrong.

In this article, I will share a technique to manually build a system that sends yourself/others summarized articles of topics that interests you, using R. Consider this as your first step in taking control of what kind of information you receive through out the day. For that, I chose the website ‘MarketWatch’ which is a site that provides the latest stock market, financial and business news. In doing so, we’ll be using a package called ‘rvest’, which is a package that was built to easily scrape data from html web pages.

Let’s get started.

As the first step in our journey, we must visit the website that we are planning to scrape the data from. I have recently developed an interest in understanding the impact of the Coronavirus on the financial markets. Therefore, as you can see below, I have chosen the articles published on ‘Market Watch’ that contain any information about the Coronavirus.

Before we start with the coding part, it’s better to get to know the packages and classes we’ll be using. Therefore, my advice for anyone who’s struggling to understand what a certain function is or what it actually does is to come back to this point and read thoroughly on what the function is all about.

Rvest

Wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, then manipulate, HTML and XML.

Read_html : Simply read HTML or XML.
html_nodes : Easily extract pieces out of HTML documents using XPath and CSS selectors
html_attr : Extract attributes from html.
html_text : Extract text from html.

Lubridate

Lubridate provides tools that make it easier to parse and manipulate dates.

gsub : gsub perform replacement of the first and all matches respectively.
parse_date_time : User friendly date-time parsing function. Parses an input vector into POSIXct date-time object.
ymd_hms : The ymd_hms() family of functions recognizes all non-alphanumeric separators and correctly handles heterogeneous date-time representations.
difftime : Time intervals creation, printing, and some arithmetic.

stingr

A consistent, simple and easy to use set of wrappers around the fantastic ‘stringi’ package.

str_squish : str_squish() reduces repeated whitespace inside a string.

LSA fun

Offers methods and functions for working with Vector Space Models of semantics, such as Latent Semantic Analysis.

genericSummary : Selects sentences from a text that best describe its topic
paste : Link vectors after converting to character.
append : Used for vector merging and also adds elements to a vector.

sendmailR

Simplistic sendmail utility for R. Uses SMTP to submit a message to a local SMTP server.

sendmail : Sends mails.

To get to our final goal of building a system that automatically sends emails containing summarized articles relating to the Coronavirus, it is required to find five useful variables from each article. The URL of the website, published date and time of the article, time since the article got published, title of the article and the body of the article. The reason for choosing each of the five variables will be mentioned along the way.

Getting the first variable (URLs) is pretty straight forward.

An important thing to note here would be as to how a person should find the terms “div.searchresult a” and “href”. Below pictures will show the exact steps to find them.

Step 1
Step 2
Step 3

The reason to find the date and the time that the article was uploaded is to find out the second variable, which shows how long ago the article was uploaded. In doing so you must convert the raw data into standard times and then convert the standard times into your local time. For these specific tasks, we introduce the package ‘lubridate’. After the data is in the necessary form that we desire, finding the time difference is just a matter of finding the difference between the published time (Local) and your system’s time.

Since we are only interested in looking at the latest articles, it’s much more appropriate to put a time limit as to how late an article can be in order to be categorized as “latest”. I decided the limit to be 15 hours. Therefore, summarized version of any article that was published before 15 hours ago will be included in the mail.

Now that we have obtained sufficient data for 3 out of 5 variables to build the system, we are left with extracting the title and the body of each article. This task is not easy as it sounds because often times we end up with paragraphs and lines containing “\n” terms and long spaces between words as shown below.

Therefore it’s critical that we also have some knowledge on cleaning our texts using the ‘stingr’ package which gets rid of the unnecessary terms and spaces in text outputs.

Now that we have the bodies of the articles cleaned up and ready to go, we are left to summarize the bodies of the selected articles as our final step before sending out the email. For this task, we will be using the package called ‘LSA fun’.

After we get the summarized versions of the articles that qualified as latest, we can combine the titles and summarized bodies using ‘append’ and build a data frame with the two columns.

After coupling up titles and summaries, we can simply use the ‘sendmailR’ package and its functions to send the email.

Now that you know how to build an automated system to send summarized articles using R, you can use this to get the latest information on any topic that you’re interested in. Even though this system is far away from the quality of summarization that humans achieve by manually reading articles and later summarizing, it certainly is a good place to start. Therefore when it comes to building automated systems, the question that we should all be asking ourselves is “Is there any thing in my life that I do repetitively which could be replaced by an automated system?” and start from there.

--

--

Mandula Thrimanne
Analytics Vidhya

Data Analyst | Storyteller | "Best way of learning about anything is by doing"