When I made a new Python web app, I used to use Flask or Django. However, they are sometimes complicated and require professional knowledge on web development.
Now, Streamlit, a new Python library for creating Python apps, has surprisingly reduced the time of web development. Streamlit is an amazing tool for those who want to focus more on analytics, modeling, and other module development rather than web designing such as HTML, CSS and Javascript.
If you want to explore further, you can learn more about Streamlit here.
Motivation
There are many insightful news sites around the world, but checking them all every day takes too much time. Especially, it is difficult for me to catch up with all of the recent tech news. Therefore, I wanted a news aggregator which collects and summarizes the most recent and popular articles from my favorite tech news sites, which will be more efficient and accurate than wandering the ocean of random tweets. In this article, I will walk through my project of a news content aggregator Python app.
Full source code is also available here!
Table of Contents
Web Scraping
Summarizing the articles
Deploying with Streamlit
Conclusion
Web Scraping
First, we start from web scraping to extract data from websites.
Here, I use requests and BeautifulSoup, which are the great Python libraries for web scraping.
aggregator.py
import requests
from bs4 import BeautifulSoup
r = requests.get('https://techcrunch.com/')
soup = BeautifulSoup(r.content, 'html.parser')
Let's see which parts are the contents we want and extract them using BeautifulSoup.find_all().
notebook.ipynb
elements = soup.find_all('a', attrs={"class": "post-block__title__link"})
for i in range(5):
print(elements[i].text.replace('\n','').replace('\t',''))
print(elements[i].get('href'))
Summarizing the articles
Now we've got the titles of the articles, but only listing the titles doesn't sound cool.
Next, let's make a summarizer of each article to make our app more useful. In this project, I made a simple NLP model using Scapy.
Before applying to a NLP model, text data need to be preprocessed. I will show the summary of NLP data preprocessing in the picture below. I skip the details because it is out of the scope of this article, but if you want to learn more about NLP data preprocessing, this is a good article to start with.
In this project, I made a simple summarizer in the steps below.
Tokenize the text with the SpaCy pipeline.
Count the number of times a word is used.
Calculate the sum of the normalized count for each sentence.
Extract a the highest ranked sentences.
summarizer.py
summarize(txt, n_sentence = 1)
After applying to our summarizer, the summary of an article looks like this.
Deploying with Streamlit
Now, let's deploy our app using Streamlit. Although you have to create a new component by yourself if you want some advanced operations, Streamlit covers basic functionalities such as markdown, graphs, and buttons. It is supersimple to deploy your app if you use only these basic elements.
If you want to learn more about Streamlit, this video is the best to start with!
main.py
import streamlit as st
from aggregator import Aggregator
st.title('Tech News Aggregator')
agg = Aggregator(<SOURCE>)
for j in range(len(agg.titles)):
c.write('**'+agg.titles[j] + '** ' + '([link](%s))' % agg.urls[j])
txt = agg.summarize_text(agg.urls[j],n_sentence = 1)
c.write(txt)
Now our tech news aggregator looks like this.
Conclusion
In summary, a news aggregator app was created using Streamlit.
The most recent articles of popular tech news websites were extracted using web scraping, and the contents of each articles were summarized using NLP.
As the applications of this project, other aggregator such as sports news aggregator, COVID data reporting, and movie summarizer, will be interesting topics.
Comments