How I Used a Web Crawler to Obtain Salary Data for a STEM Project Aimed at Empowering Young Women

Erica Mangino Giuliani

Oct 1, 20214 min read

In today's digital age, technology and data play a key role in shaping our future, especially in supporting underrepresented groups in fields like STEM (Science, Technology, Engineering, and Mathematics). To encourage young girls to explore these industries, I embarked on a project in my masters program to create a web application that provides guidance counselors and teachers with resources to spark interest in STEM subjects. A pivotal aspect of this project was developing a web crawler to extract salary data from the Bureau of Labor Statistics (BLS).

In this post, I will share my journey through this project, the challenges I faced, and the valuable lessons I learned along the way.

Understanding the Project's Purpose

Before getting into the technical details, it's important to consider the aim of my project. We wanted to inspire young girls to pursue careers in STEM, which remain male-dominated. By providing information about salaries, we aimed to highlight the financial rewards of these careers.

For example, STEM jobs tend to offer higher salaries compared to non-STEM positions. According to the BLS, the median annual wage for STEM occupations in May 2022 was about $98,000, compared to $45,760 for non-STEM jobs. This kind of information helps underscore the potential benefits of these careers.

The Role of the Web Crawler

Web crawlers are scripts that automatically browse the internet to collect data from different sources. I used the programming language R to build this crawler, which is powerful for data analysis. The goal was to gather salary information from the BLS, which hosts a wealth of labor statistics that are crucial for our project.

Developing the web crawler was a wonderful learning experience. It required coding, critical thinking, and problem-solving skills, which helped me appreciate the complexities of data manipulation.

Challenges Faced

Creating the web crawler was not without its challenges.

One major hurdle was making sure the crawler could effectively navigate the BLS website and extract only the necessary data. With a vast amount of information available, it was easy to feel overwhelmed. Focusing on specific salary metrics relevant to our project required significant precision.

Moreover, the website's structure occasionally changed, which meant I had to adjust my initial code regularly to keep up with the BLS's evolving format. For instance, during one update, the way salary data was organized shifted, and I had to rewrite sections of my script to accommodate these changes.

This experience taught me the importance of perseverance and flexibility. I learned that overcoming obstacles is part of the journey.

Technical Aspects of the Crawler

For those interested in the specifics of building the web crawler, here's a basic outline of the steps I took:

Installation of Necessary Libraries: I began by installing key libraries in R, like `rvest` for web scraping and `dplyr` for data manipulation.
Building the Crawler: Using R scripts, I programmed the crawler to navigate the BLS website, focusing on sections containing salary information.
Data Extraction: I used functions such as `html_nodes()` to identify and collect relevant HTML elements with salary data, allowing for efficient data gathering.
Data Cleaning and Formatting: After collecting the raw data, I applied `dplyr` functions to clean and format it for straightforward analysis.
Storage: The final dataset was saved as a CSV file, making it easy for my teammates to utilize in the broader project.

This technical process was highly fulfilling, providing me with practical experience in web scraping and data management—skills that will undoubtedly be valuable in my future endeavors.

Leveraging Data for Empowerment

Once I had the salary data, the next step was using it to illustrate diverse career paths in STEM.

The data enabled us to create visualizations that highlighted salary variations across different STEM roles. By presenting this information engagingly, we aimed to inspire interest in young girls by showing them the tangible rewards of pursuing careers in these fields.

Importance of Data-Driven Decision Making

One key lesson I took away was the power of data-driven decision-making. By using real-world statistics, we grounded our project in solid evidence.

This focus on data not only enhanced our application but also equipped teachers and counselors with the resources they need to effectively motivate young learners. For instance, presenting data on how women in engineering earn an average salary of $90,000 can motivate a young girl considering that path.

Reflecting on the Journey

As I consider my experience building a web crawler to gather salary data from the Bureau of Labor Statistics, I feel proud of what we accomplished. This project went beyond just coding; it significantly contributed to empowering future generations and addressing gender disparity in STEM fields.

The insights gained from working with data and developing practical skills in web crawling are invaluable. I hope to keep learning about how technology can promote STEM careers for young girls.

Ultimately, this project's aim transcends mere data collection—it's about inspiring the next generation of women to recognize their potential and engage in careers they are passionate about, armed with facts, figures, and creativity.

The world of STEM is calling for sharp minds, and who knows? You might become the next innovator in this exciting field!

Let's continue to advocate for inclusion in STEM, one young girl at a time!