Web scraping, Beautifulsoup, Selenium – Various explorations in web scraping with Python and jumping timidly in surveillance capitalism
I was in a hurry, to unleash the last post of the year’s 2020. Finally, I postponed the publication as most of the issues addressed in this post are so 2021!
To make it short, last year, I was asked by a friend to apply web scrapping in a competitors’ spying campaign. I should say Business Intelligence, it is more corporate and politically correct! Anyway, it does not surprise me so much that learning Testing, Automation and Python will lead me to this area partially included in Surveillance Capitalism Phenomenon. After all, this is still information. Precisely, in my case, the guy wanted solely a powerful automation that will collect, process and order a lot of public information available on webpages. [SIC]
I won’t tell you, how, what and why but let me just say that with the help of Python, NLP, Web scraping… I can ensure you that we pinned down the crook.
In conclusion, transparency dictatorship that prevails in surveillance capitalism has its best days ahead of it. Indeed, Digital has become a Personal Data Motherload!
On a more personal POV 🙂 Here is what I learnt from this experience:
- It has become tremendously easy to “hijack” or divert technologies from their first uses to make a Surveillance Swiss Army Knife even for an absolute beginner like me.
- Learning is rewarding but practising is even better. With this usecase, I was suddenly no longer facing theory but a complex reality. I took it as an opportunity to test my solving problem capability. That was a true exercise to practice what I have just learned (Python mostly).
Hope that this quick and mundane introduction did not have a chilling effect on you! Here is some commands and scripts about Web scraping, all in Python. I have purposely excluded the things made with NLP or using JavaScript End to End Testing Framework such as CodeceptJS or Cypress that are beyond the scope of this post.
You can get all code and files on my github account in webscraping_with_python
By the way, a great source of inspiration, all the works of Al Sweigart @https://inventwithpython.com/. “Automate the Boring Stuff with Python” has been my recent bedside reading!
Some elements about Web Scraping
Here are various examples that automate the browser usage in order to get a map on Google Map or perform a search on Google, or download and save file. I made some attempts to grab HTML content with the help of BeautifulSoup, so all filenames with the pattern: slurpThatSoup_1.py, slurpThatSoup_2.py… are handling a BeautifulSoup Object from HTML
FYI, BeautifulSoup (bs4) is one of the best Web Scraping library in Python is BeautifulSoup. You can find more on the official websites and documentation.
- Beautiful Soup: We called him Tortoise because he taught us.https://www.crummy.com/software/BeautifulSoup/
- Beautiful Soup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
As a huge supporter of testing, I made an extended usage of selenium Module to control the Browser, leveraging on Selenium’s WebDriver Methods to Finding HTML Elements.
Even so, it is sometime easier to divert testing frameworks such as CodeceptJS to perform actions in browser instead of using Selenium.
Due to space disk limitations, I am using Anaconda to build my personal environment on an external disk. See the following previous post about conda install at Python, Anaconda, WordPress, JSON, JSON-SCHEMA – Messy post with few practices and feedback from my P.O experience.
Commands to install the required python libraries
# install requests and beautifulsoup4 pip install requests pip install beautifulsoup4 # install pyperclip pip install pyperclip # check if pyperclip is working # go to a python console >>> import pyperclip >>> import requests >>> import beautifulsoup4 |
Check if the installed librairies are working
# check if pyperclip is working # go to a python console >>> import pyperclip >>> import requests >>> import beautifulsoup4 # if you get no errors, you are fine... |
Installing the selenium Module
# required installation pip3 install selenium # better using homebrew brew install geckodriver # check version geckodriver --version # which which geckodriver # Source: https://www.dev2qa.com/how-to-resolve-webdriverexception-geckodriver-executable-needs-to-be-in-path/ |
Conclusion : Web Scraping is just an appetizer, that’s called not see the wood for the trees. It gave me the occasion to deepen a subject that matters to me : AI models generating misinformation or giving illusions of meaning. This time, I was chasing a crook but next time I’ll be the liar or the crook with the help these technologies, who knows? Ethics are melting like the polar ice! See below in Read More section advanced posts on the subject.
Read more
- Google collects a frightening amount of data about you. You can find and delete it now
https://www.cnet.com/how-to/google-collects-a-frightening-amount-of-data-about-you-you-can-find-and-delete-it-now/ - Surveillance capitalism on Wikipedia
https://en.wikipedia.org/wiki/Surveillance_capitalism - We read the paper that forced Timnit Gebru out of Google. Here’s what it says.
https://www.technologyreview.com/2020/12/04/1013294/google-ai-ethics-research-paper-forced-out-timnit-gebru/ - A college kid’s fake, AI-generated blog fooled tens of thousands. This is how he made it.
https://www.technologyreview.com/2020/08/14/1006780/ai-gpt-3-fake-blog-reached-top-of-hacker-news/ - Facebook translates ‘good morning’ into ‘attack them’, leading to arrest
https://www.theguardian.com/technology/2017/oct/24/facebook-palestine-israel-translates-good-morning-attack-them-arrest - This could lead to the next big breakthrough in common sense AI
https://www.technologyreview.com/2020/11/06/1011726/ai-natural-language-processing-computer-vision/ - 100 must-read classic books, as chosen by our readers
https://www.penguin.co.uk/articles/2018/100-must-read-classic-books/