An Introduction to Web Scraping for Research

Like web archiving, web scraping is a process by which you can collect data from websites and save it for further research or preserve it over time. Also like web archiving, web scraping can be done through manual selection or it can involve the automated crawling of web pages using pre-programmed scraping applications.

Unlike web archiving, which is designed to preserve the look and feel of websites, web scraping is mostly used for gathering textual data. Most web scraping tools also allow you to structure the data as you collect it. So, instead of massive unstructured text files, you can transform your scraped data into spreadsheet, csv, or database formats that allow you to analyze and use it in your research. 

There are many applications for web scraping. Companies use it for market and pricing research, weather services use it to track weather information, and real estate companies harvest data on properties. But researchers also use web scraping to perform research on web forums or social media such as Twitter and Facebook, large collections of data or documents published on the web, and for monitoring changes to web pages over time. If you are interested in identifying, collecting, and preserving textual data that exists online, there is almost certainly a scraping tool that can fit your research needs. 

Please be advised that if you are collecting data from web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.  (more…)

Link Roundup November 2019

light bulb

Jennifer Patiño

World Digital Preservation Day is November 7th and this year’s theme is “At-Risk Digital Materials.”

Researchers at the University of Hawaiʻi at Mānoa uncovered a glitch in a computer program that produced different results depending on operating systems, possibly affecting more than 100 published studies. A good reminder to make sure you have a detailed README file for any code you create!

Wired reports on a study in Science that revealed racial bias in a widely used algorithm that assigned lower levels of care to Black patients in U.S. hospitals. The study shows how by focusing on healthcare costs, the algorithm replicated disparities in access and provides suggestions on reformulating the algorithm.

Kent Emerson

Researchers at UW-Madison’s Wisconsin Institute for Discovery, working on a project called Wisconsin Expansion of Renewable Electricity with Optimization under Long-term Forecasts (WEREWOLF), are producing mathematical models that will help policy makers make decisions about the future of Wisconsin’s renewable energy resources.

The Roy Rosenzweig Center for History and New Media at George Mason University is celebrating its 25th anniversary. During this time, the RRCHNM has produced some of the most widely used open source digital resources including Omeka, Zotero, and Tropy as well as discrete art and art history projects.

Link Roundup October 2019

light bulb

In this series, members of the RDS team share links to research data-related stories, resources, and news that caught their eye each month. Feel free to share your favorite stories with us on Twitter @UWMadRschSvcs!

Cameron Cook

UW Madison’s Information Technology Office has kindly generated 3 Tips to Manage Google Drive. These are designed to help you manage your “personal” and UW Madison G Suite accounts.

October is National Cybersecurity Awareness Month and the Office of Cybersecurity will be hosting a series of informational events throughout the month.

Clare Michaud

In “Managing 100 Digital Humanities Projects: Digital Scholarship & Archiving in King’s Digital Lab,” the authors outline the process of managing digital humanities projects at King’s College London and stress the importance of partnerships between libraries, IT, and researchers in the creation of successful and sustainable digital projects.

Kent Emerson

In their Annual Report, “Cultivating Princeton’s Data Landscape”, The Center for Digital Humanities @Princeton reflects on their 2018-2019 “Year of Data”. Throughout the year, the CDH hosted a keynote address by Safia Noble, workshops for students and faculty, and served as a hub for connecting researchers, teachers, and resources.

An Introduction to Web Archiving for Research

Web archiving is the practice of collecting and preserving resources from the web. The most well known and widely used web archive is the Internet Archive’s Wayback Machine. The Internet Archive was launched in 1996 by Brewster Kahle with the mission of providing “Universal Access to All Knowledge” The Wayback Machine uses an automated process called crawling to collect pages from all over the web and stores them on servers at the Internet Archive headquarters in San Francisco. 

Institutions such as government agencies, universities, and libraries also actively archive the web, but often with narrower collection scope. There are also many web archiving projects run by smaller teams and individual researchers, and these too usually have specific areas of focus. If there are web resources you are interested in collecting and preserving, with a little research and learning of the tools, you can absolutely create your own web archive. 

Please be advised that if you are archiving web pages, forums, social media, or other web materials for research purposes and it may constitute human subjects research, you must consult with and follow the appropriate UW-Madison Institutional Review Board process as well as follow their guidelines on “Technology & New Media Research”.  (more…)

Research Bazaar Call for Art Submissions

Art has an important role to play in helping the public make sense of complex data in new and exciting ways. Able to see the patterns in the data, artists and data scientists alike, translate information into visual and aesthetic forms that increase awareness and make complicated issues easier to understand. The planning committee for UW-Madison’s Data Science Research Bazaar seeks submissions for artwork influenced by data science or created by data scientists for display. We welcome submissions from both campus members and the public, and encourage submitters to think broadly on the intersections between data science and art.

(more…)

The Data Science Research Bazaar Seeks Submissions

Cross-posted from the Data Science Hub

The Data Science Hub is excited to invite you to participate in the inaugural Data Science Research Bazaar by submitting your ideas to present!

UW-Madison’s Data Science Research Bazaar is a practical, two-day, hands-on, unconference-style event for all members of the UW-Madison campus community who are interested in data science, from expert methodologists to novice learners just getting their feet wet with data science tools. Presenters from all disciplines, all UW-Madison affiliated individuals, and individuals from the surrounding Madison area are encouraged to apply. Help make the Research Bazaar a successful exchange of ideas and skills by participating and submitting your idea to present. The Research Bazaar is happening at the Discovery Building on January 24-25, 2020.

Proposals are due on November 15, 2019, unless otherwise noted in a specific call.

The Research Bazaar is seeking proposals for the following presentation formats and workshops:

The Research Bazaar is also seeking proposals for the following networking opportunities:

If you have any questions, please send an email to contact@datascience.wisc.edu.