Political Stances - Data Cleaning
0. Data Collection
Author: Natalie Castro Date: 1/15/2025
The purpose of this notebook is to collect varying forms of data from different sources on the Internet to answer the research question:
What are characteristics of self-identified political parities expressed in 2025 in climate change?
š1. Environment Creation
1.1 Library Import
''' DATA QUERYING '''
from bs4 import BeautifulSoup
import json
import requests
from time import sleep
import pypdf
''' DATA MANAGEMENT '''
import pandas as pd
import regex as re
C:\Users\natal\miniconda3\lib\site-packages\pypdf\_crypt_providers\_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from this module in 48.0.0.
from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
1.2 Secret Storage
''' NEWSAPI KEY'''
api_key = 'xxxxxxxxxxxxxxxxxxxx'
š”2. API Requests
The keywords used in this analysis will be ādemocrat+climateā and ārepublican+climateā
2.1 NewsAPI
For NewsAPI, three types of data will be collected. First, using the āeverythingā endpoint which collects every article in the past five years from their corpora using a keyword. Next, will be the top headlines on January 20th, President Donald Trumpās inaguration day. The final endpoint used will be the āsourcesā to better undestand media production for each key word during this date.
2.1.1 NewsAPI: Everything Endpoint
The code shown below is how the I got the API up and running, this was then iterated upon with the parameter āpageā to extract the entire set of results.
''' BUIDLING THE URL '''
base_url = "https://newsapi.org/v2/everything?"
url_post = {'apiKey':api_key,
'source':'everything',
'q':'democrat+climate', ## this iteration will be looking at democrat referencing articles
'language':'en', ## selecting English as the data
'sortBy':'popularity', ## used to generate a popularity label
}
url_post2 = {'apiKey':api_key,
'source':'everything',
'q':'republican+climate', ## this iteration will be looking at democrat referencing articles
'language':'en', ## selecting English as the data
'sortBy':'popularity', ## used to generate a popularity label
}
''' MAKING THE REQUEST '''
response = requests.get(base_url,url_post)
''' CHECKING OUT THE RESPONSE: DEMOCRAT '''
text = response.json()
dem_text['articles'][0]
{'source': {'id': None, 'name': 'CNET'},
'author': 'Katie Collins',
'title': 'For Progress on Climate and Energy in 2025, Think Local',
'description': "As Trump and his anti-science agenda head for the White House, look to America's city and state leaders to drive climate action and prioritize clean energy.",
'url': 'https://www.cnet.com/home/energy-and-utilities/for-progress-on-climate-and-energy-in-2025-think-local/',
'urlToImage': 'https://www.cnet.com/a/img/resize/5fa89cffd3d573f39b4cf70398e5bb4b3038a2d7/hub/2024/12/31/540a7c2e-63f9-445f-a311-5744bcce16a2/us-map-localized-energy-progress.jpg?auto=webp&fit=crop&height=675&width=1200',
'publishedAt': '2025-01-03T13:00:00Z',
'content': 'With its sprawling canopy of magnolia, dogwood, southern pine and oak trees, Atlanta is known as the city in the forest. The lush vegetation helps offset the pollution from the commuter traffic as pe⦠[+16751 chars]'}
rep_text = republican_response.json()
rep_text.keys()
dict_keys(['status', 'totalResults', 'articles'])
rep_text['totalResults']
1559
rep_text['articles'][0]
{'source': {'id': 'the-verge', 'name': 'The Verge'},
'author': 'Nilay Patel',
'title': 'Trumpās first 100 days: all the news impacting the tech industry',
'description': 'President Donald Trump is taking on TikTok, electric vehicle policy, and AI in his first 100 days in office. This time around, he has the backing of many tech billionaires.',
'url': 'https://www.theverge.com/24348851/donald-trump-presidency-tech-science-news',
'urlToImage': 'https://cdn.vox-cdn.com/thumbor/Nwo4_i4giY8lRM0Rtzih1IHTSLU=/0x0:2040x1360/1200x628/filters:focal(1020x680:1021x681)/cdn.vox-cdn.com/uploads/chorus_asset/file/25531809/STK175_DONALD_TRUMP_CVIRGINIA_C.jpg',
'publishedAt': '2025-01-22T14:30:00Z',
'content': 'Filed under:\r\nByLauren Feiner, a senior policy reporter at The Verge, covering the intersection of Silicon Valley and Capitol Hill. She spent 5 years covering tech policy at CNBC, writing about antit⦠[+7943 chars]'}
The structure of the response is a nested dictionary, with each list entry in the response a dictionary for the respective news article.
š«Iterating for Democrat Articles
''' PAGE TURNER
INPUT: the desired page for the API call, the keyword used to build the URL
OUTPUT: a list of the articles from the page
The function page_turner is used to collect the entire corpus from the NEWSAPI
for a particular keyword. This function is used wrapped into a for loop so
it can build a new url for each distinct page.
'''
base_url = "https://newsapi.org/v2/everything?"
def page_turner(page_number,keyword):
sleep(2)
## Building the post URL for every page in the iteration
url_post = {'apiKey':api_key,
'source':'everything',
'q':keyword, ## this iteration will be looking at democrat referencing articles
'language':'en', ## selecting English as the data
'sortBy':'popularity', ## used to generate a popularity label
'page':page_number}
response = requests.get(base_url,url_post)
json_ = response.json()
#print (json_.keys())
return(json_['articles'])
nested_responses_democrat = []
for page in range(1,7):
page_contents = page_turner(page,'democrat+climate')
nested_responses_democrat.append(page_contents)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[20], line 4
1 nested_responses_democrat = []
3 for page in range(1,7):
----> 4 page_contents = page_turner(page,'democrat+climate')
5 nested_responses_democrat.append(page_contents)
Cell In[13], line 29, in page_turner(page_number, keyword)
27 json_ = response.json()
28 #print (json_.keys())
---> 29 return(json_['articles'])
KeyError: 'articles'
nested_responses_republican = []
for page in range(1,17):
page_contents = page_turner(page,'republican+climate')
nested_responses_republican.append(page_contents)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[31], line 4
1 nested_responses_republican = []
3 for page in range(1,17):
----> 4 page_contents = page_turner(page,'republican+climate')
5 nested_responses_republican.append(page_contents)
Cell In[13], line 29, in page_turner(page_number, keyword)
27 json_ = response.json()
28 #print (json_.keys())
---> 29 return(json_['articles'])
KeyError: 'articles'
len(nested_responses_republican)
5
š¦ 2.2 Congress.Gov API
The purpose of collecting data from this API is to understand how different political sides have represented + instutionalized their view about climate change.
GitHub Documentation
Using Congress Data Offsite
Congress API Endpoints
Python Code Examples
api_key = 'XXX'
base_url = 'https://api.congress.gov/v3/bill?api_key=te7ilzFKEeAOrjfEalH5mrtFU0Dw35E6B70Nfhnn'
url_post = {
'format':'json', # specifying the response format
'offset':0, ## specifying the start of the records returned,
'limit':10 ## specifying the number of records returned
}
## For government APIs, it's generally good practice to provide some sort of user agent!
user_agent = {'user-agent': 'University of Colorado at Boulder, natalie.castro@colorado.edu'}
''' TEST 2'''
test2_response = requests.get(base_url,url_post,headers=user_agent)
test2_response
<Response [200]>
2.2.1 Collecting Bill Numbers
To do so, I will be changing the URL for a few different parameters and scraping the congress site. The filters generated are: legislation any status of legislation, and environmental protection policy area.
The URL for the bill search (as of 1/28/2025) is:https://www.congress.gov/search?q=%7B%22congress%22%3A%22all%22%2C%22source%22%3A%22all%22%2C%22bill-status%22%3A%22all%22%2C%22subject%22%3A%22Environmental+Protection%22%7D
A CSV was downloaded with the bill numbers with a total of 8,056 bills. The original downloaded CSV comes with three āmetadataā lines, these were consisted of the date collected, and the URL, which I have listed here. The three lines were deleted to read them in using Pandas.
bill_information1 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_119_113.csv",encoding='utf-8')
bill_information2 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_112_103.csv",encoding='utf-8')
bill_information3 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_102_95.csv",encoding='utf-8')
bill_information4 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_94_93.csv",encoding='utf-8')
bill_information = pd.concat([bill_information1,bill_information2,bill_information3,bill_information4])
bill_information.head()
Legislation Number | URL | Congress | Title | Sponsor | Date of Introduction | Committees | Latest Action | Latest Action Date | Number of Cosponsors | Amends Bill | Date Offered | Date Submitted | Date Proposed | Amends Amendment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | H.R. 375 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Continued Rapid Ohia Death Response Act of 2025 | Tokuda, Jill N. [Rep.-D-HI-2] (Introduced 01/1... | 1/13/2025 | House - Natural Resources, Agriculture | Senat... | Received in the Senate and Read twice and refe... | 1/24/2025 | 1 | NaN | NaN | NaN | NaN | NaN |
1 | H.R. 349 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Goldieās Act | Malliotakis, Nicole [Rep.-R-NY-11] (Introduced... | 1/13/2025 | House - Agriculture | Referred to the House Committee on Agriculture. | 1/13/2025 | 6 | NaN | NaN | NaN | NaN | NaN |
2 | H.R. 313 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Natural Gas Tax Repeal Act | Pfluger, August [Rep.-R-TX-11] (Introduced 01/... | 1/9/2025 | House - Energy and Commerce | Referred to the House Committee on Energy and ... | 1/9/2025 | 4 | NaN | NaN | NaN | NaN | NaN |
3 | H.R. 288 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Long Island Sound Restoration and Stewardship ... | LaLota, Nick [Rep.-R-NY-1] (Introduced 01/09/2... | 1/9/2025 | House - Transportation and Infrastructure, Nat... | Referred to the Subcommittee on Water Resource... | 1/10/2025 | 4 | NaN | NaN | NaN | NaN | NaN |
4 | H.R. 284 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | GLRI Act of 2025 | Joyce, David P. [Rep.-R-OH-14] (Introduced 01/... | 1/9/2025 | House - Transportation and Infrastructure | Referred to the Subcommittee on Water Resource... | 1/10/2025 | 28 | NaN | NaN | NaN | NaN | NaN |
''' CREATING CONGRESS TITLES '''
def congress_finder(current_congress):
## looking for, but not including the th or st
regex_pattern = '[0-9]{2,3}(?!=[a-z]{2})'
congress_match = re.findall(regex_pattern,current_congress)
congress_num = int(congress_match[0])
return(congress_num)
bill_information['Congress Number'] = bill_information['Congress'].apply(lambda x: congress_finder(x))
''' CREATING A COLUMN FOR BILL TYPE:
The structure of all of the legislation numbers is [BILL TYPE]_[BILL NUMBER]. I will be using
Regex to extract and separate these columns
'''
''' BILL TYPE CLEANER: This function is used in a lambda apply down the rows to create the
bill types needed for the congress.gov API
'''
bt_pattern = r'[A-Za-z]+\.*'
def bill_type_cleaner(bt):
matches = re.findall(bt_pattern,bt)
type_dirty = ''.join(matches)
type_text = re.sub("\.","",type_dirty)
type_clean = type_text.lower()
return (type_clean)
bill_information['Bill Type'] = bill_information['Legislation Number'].apply(lambda x: bill_type_cleaner(x))
''' CREATING A COLUMN FOR BILL NUMBER '''
bt_num_pattern = r'[^A-Za-z\.]'
def bill_num_cleaner(bt):
matches = re.findall(bt_num_pattern,bt)
type_dirty = ''.join(matches)
type_clean = type_dirty.lower().strip()
return (int(type_clean)) ## The API asks for an integer
bill_information['Bill Number'] = bill_information['Legislation Number'].apply(lambda x: bill_num_cleaner(x))
''' CREATING A COLUMN FOR SPONSOR AFFILIATION & CREATING A COLUMN FOR SPONSOR STATE ''' '''
affiliation_pattern = r'-[DRI]'
state_pattern = r'-[A-Z]{2}'
def affiliation_finder(sponsor):
## For party affiliation
match = re.findall(affiliation_pattern,sponsor)
clean_affiliation = re.sub("-","",match[0])
return (clean_affiliation)
def state_finder(sponsor):
## For State affiliation
state_match = re.findall(state_pattern,sponsor)
clean_state = re.sub("-",'',state_match[0])
return (clean_state)
bill_information['Sponser Affiliation'] = bill_information['Sponsor'].apply(lambda x: affiliation_finder(x))
bill_information['Sponser State'] = bill_information['Sponsor'].apply(lambda x: state_finder(x))
bill_information.head(2)
Legislation Number | URL | Congress | Title | Sponsor | Date of Introduction | Committees | Latest Action | Latest Action Date | Number of Cosponsors | Amends Bill | Date Offered | Date Submitted | Date Proposed | Amends Amendment | Congress Number | Bill Type | Bill Number | Sponser Affiliation | Sponser State | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | H.R. 375 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Continued Rapid Ohia Death Response Act of 2025 | Tokuda, Jill N. [Rep.-D-HI-2] (Introduced 01/1... | 1/13/2025 | House - Natural Resources, Agriculture | Senat... | Received in the Senate and Read twice and refe... | 1/24/2025 | 1 | NaN | NaN | NaN | NaN | NaN | 119 | hr | 375 | D | HI |
1 | H.R. 349 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Goldieās Act | Malliotakis, Nicole [Rep.-R-NY-11] (Introduced... | 1/13/2025 | House - Agriculture | Referred to the House Committee on Agriculture. | 1/13/2025 | 6 | NaN | NaN | NaN | NaN | NaN | 119 | hr | 349 | R | NY |
bill_information.reset_index(inplace=True)
bill_information.drop(columns='index',inplace=True)
2.2. Collecting Bill Text URLS to Scrape from the Congress API
''' BUILDING A FUNCTION FOR THE URLS '''
def url_builder(congress,bill_type,bill_number):
base_url = "https://api.congress.gov/v3/bill/"
request_url = ""
request_url = request_url + base_url + str(congress) + "/" + bill_type + "/" + str(bill_number) + '/text?api_key=te7ilzFKEeAOrjfEalH5mrtFU0Dw35E6B70Nfhnn'
return (request_url)
user_agent = {'user-agent': 'University of Colorado at Boulder, natalie.castro@colorado.edu'}
''' BUILDING A FUNCTION FOR THE API REQUESTS TO GATHER THE DATA '''
def xml_link_collector(url):
## Making a response with our URL
response = requests.get(url,headers=user_agent)
## Making sure the response was valid
try:
out = response.json()
collected_url = out['textVersions'][0]['formats'][2]['url'] ## This was determined through parsing the output for a test example
return (collected_url)
except:
print (f"ā ļø uh oh! there was an error when using the API with this url:{url}\n ")
return ("NO URL FOUND")
url = url_builder(119,'hjres',30)
xml_link_collector(url)
'https://www.congress.gov/119/bills/hjres30/BILLS-119hjres30ih.xml'
Collecting the Bill Information:
This will take place in two parts, because the rate limit for the congress API is 5000 requests per hour
''' SCRAPING XMLS USING THE API -- [PART 1]'''
url_list = []
for row in range(0,len(bill_information[0:4998])):
congress = bill_information.at[row,'Congress Number']
bill_type = bill_information.at[row,'Bill Type']
bill_number = bill_information.at[row,'Bill Number']
current_url = url_builder(congress,bill_type,bill_number)
## Making sure the API connection is well rested (ie - avoiding the rate limit)
if row % 100 == 0:
sleep (5)
xml_found = xml_link_collector(current_url)
url_list.append(xml_found)
The otuput from the above cell was removed because there was a lot of errors, and when converting to HTML I did not want to bog down the page! Here is what some of the errors looked like
print (len(url_list))
4998
url_list[3249]
'https://www.congress.gov/109/bills/s2920/BILLS-109s2920is.xml'
url_list = url_list[:3250]
url_list[3249:3250]
['https://www.congress.gov/109/bills/s2920/BILLS-109s2920is.xml']
saving_urls = pd.DataFrame(url_list)
saving_urls.to_csv("Found URLs for Bills 0 - 3250.csv")
''' SCRAPING XMLS USING THE API -- [PART 2] '''
for row in range(3250,len(bill_information[3250:])):
congress = bill_information.at[row,'Congress Number']
bill_type = bill_information.at[row,'Bill Type']
bill_number = bill_information.at[row,'Bill Number']
current_url = url_builder(congress,bill_type,bill_number)
## Making sure the API connection is well rested (ie - avoiding the rate limit)
if row % 100 == 0:
sleep (5)
xml_found = xml_link_collector(current_url)
url_list.append(xml_found)
## Testing that all went well...
print (f"The expected length of the URL list should be {len(bill_information)}.\nThe actual length of the URL list is {len(url_list)}.")
The expected length of the URL list should be 8056.
The actual length of the URL list is 4806.
Uh-Oh! I think a lot of the older bills do not have the ability to be called for API access
To fix this, I am going to deconstruct the URLs and then only get the information on those Bills.
At this stage, I am filtering to look for what Bills do have available URLs
urls_raw = pd.DataFrame(url_list)
urls_raw.rename(columns={0:"URL"},inplace=True)
congress_pattern = '(?<=/)[0-9]{2,3}'
bt_pattern = '(?<=s/)[h,c,r,o,n,e,s,j,r]{1,7}'
bill_num_pattern = '(?<=[a-z])\d{1,4}(?!=\/BILL)'
url_congress = []
url_bt = []
url_num = []
for row in range(0,len(urls_raw)):
curr_url = urls_raw.at[row,'URL']
congress = re.findall(congress_pattern,curr_url)
bill_type = re.findall(bt_pattern,curr_url)
bill_num = re.findall(bill_num_pattern,curr_url)
if len(congress) > 0:
if len(bill_type) > 0:
if len(bill_num) > 0:
url_congress.append(int(congress[0]))
url_bt.append(bill_type[0])
url_num.append(int(bill_num[0]))
else:
url_num.append('DROP')
else:
url_bt.append('DROP')
else:
url_congress.append('DROP')
url_bt.append('DROP')
url_num.append('DROP')
urls_raw['Congress Number'] = url_congress
urls_raw['Bill Type'] = url_bt
urls_raw['Bill Number'] = url_num
## Cleaning the URLs
urls_raw.drop_duplicates(inplace=True)
## And dropping if there was any error in the regex find all
congress_condition = urls_raw['Congress Number'] != 'DROP'
bt_condition = urls_raw['Bill Type'] != 'DROP'
bn_condition = urls_raw['Bill Number'] != 'DROP'
urls_raw1 = urls_raw[congress_condition]
urls_raw2 = urls_raw1[bt_condition]
urls_clean = urls_raw2[bn_condition]
C:\Users\natal\AppData\Local\Temp\ipykernel_13272\878941524.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
urls_raw2 = urls_raw1[bt_condition]
C:\Users\natal\AppData\Local\Temp\ipykernel_13272\878941524.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
urls_clean = urls_raw2[bn_condition]
## Now mergine back from the Bill Information
bill_information_final = pd.merge(left=urls_clean,right=bill_information, on=['Congress Number','Bill Type','Bill Number'],validate='1:1')
len(bill_information_final)
3262
## Now Saving the Supplemented DataFrame
bill_information_final.to_csv("Bill Information Supplemented.csv")
url_list = bill_information_final['API URL'].to_list()
bill_information_final.head(2)
API URL | Congress Number | Bill Type | Bill Number | Legislation Number | URL | Congress | Title | Sponsor | Date of Introduction | ... | Latest Action | Latest Action Date | Number of Cosponsors | Amends Bill | Date Offered | Date Submitted | Date Proposed | Amends Amendment | Sponser Affiliation | Sponser State | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.congress.gov/119/bills/hr375/BILLS... | 119 | hr | 375 | H.R. 375 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Continued Rapid Ohia Death Response Act of 2025 | Tokuda, Jill N. [Rep.-D-HI-2] (Introduced 01/1... | 1/13/2025 | ... | Received in the Senate and Read twice and refe... | 1/24/2025 | 1 | NaN | NaN | NaN | NaN | NaN | D | HI |
1 | https://www.congress.gov/119/bills/hr349/BILLS... | 119 | hr | 349 | H.R. 349 | https://www.congress.gov/bill/119th-congress/h... | 119th Congress (2025-2026) | Goldieās Act | Malliotakis, Nicole [Rep.-R-NY-11] (Introduced... | 1/13/2025 | ... | Referred to the House Committee on Agriculture. | 1/13/2025 | 6 | NaN | NaN | NaN | NaN | NaN | R | NY |
2 rows Ć 21 columns
2.2.2 šøļøScraping XML URLs
To get the text for each bill, the XML URL will be scraped and appended to the dataframe we are working with.
''' XML-PARSER:
In the XML output, I am interested in the Bill Title and Text, although the title is already in the Bill Information,
I just want to make sure everything is correct! This parser will take an input of an XML url, make the request, parse
in the soup, and return the output as a title and text!
No preprocessing will occur at this stage, and the raw text will just be appended as a column to the Bill Information.
Later on, this will make it easier to generate the labels
'''
def xml_searcher(xml_url):
xml_output = requests.get(xml_url)
raw_xml = xml_output.text
## Using Beautiful Soup as a parser
xml_soup = BeautifulSoup(raw_xml,'xml')
## Parsing for the title
current_title = xml_soup.title
## Parsing for the text
current_text = xml_soup.text
return (current_title,current_text)
test_title ,test_text = xml_searcher(url_list[400])
test_title
<dc:title>118 S2959 RS: Brownfields Reauthorization Act of 2023</dc:title>
test_text
'\n\n118 S2959 RS: Brownfields Reauthorization Act of 2023\nU.S. Senate\n2023-09-27\ntext/xml\nEN\nPursuant to Title 17 Section 105 of the United States Code, this file is not subject to copyright protection and is in the public domain.\n\n\n\nIICalendar No. 214118th CONGRESS1st SessionS. 2959IN THE SENATE OF THE UNITED STATESSeptember 27 (legislative day, September 22), 2023Mr. Carper, from the Committee on Environment and Public Works, reported the following original bill; which was read twice and placed on the calendarA BILLTo amend the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 to reauthorize brownfields revitalization funding, and for other purposes.1.Short titleThis Act may be cited as the Brownfields Reauthorization Act of 2023.2.Improving small and disadvantaged community access to grant opportunitiesSection 104(k) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)) is amendedā(1)in paragraph (1)(I), by inserting or 501(c)(6) after section 501(c)(3);(2)in paragraph (5), by striking subparagraph (E);(3)in paragraph (6)(C), by striking clause (ix) and inserting the following:(ix)The extent to which the applicant has a planā(I)to engage a diverse set of local groups and organizations that effectively represent the views of the local community that will be directly affected by the proposed brownfield project; and(II)to meaningfully involve the local community described in subclause (I) in making decisions relating to the proposed brownfield project.;(4)in paragraph (10)(B)(iii)ā(A)by striking 20 percent and inserting 10 percent;(B)by inserting the eligible entity is located in a small community or disadvantaged area (as those terms are defined in section 128(a)(1)(B)(iv)) or after unless; and(C)by inserting , in which case the Administrator shall waive the matching share requirement under this clause before ; and; and(5)in paragraph (13), by striking 2019 through 2023 and inserting 2024 through 2029.3.Increasing grant amountsSection 104(k)(3)(A)(ii) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)(3)(A)(ii)) is amended by striking $500,000 and all that follows through the period at the end and inserting $1,000,000 for each site to be remediated..4.State response programsSection 128(a) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9628(a)) is amendedā(1)in paragraph (1)(B)(i), by striking or enhance and inserting , enhance, or implement; and(2)by striking paragraph (3) and inserting the following:(3)Authorization of appropriationsThere are authorized to be appropriated to carry out this subsectionā(A)$50,000,000 for fiscal year 2024;(B)$55,000,000 for fiscal year 2025;(C)$60,000,000 for fiscal year 2026;(D)$65,000,000 for fiscal year 2027;(E)$70,000,000 for fiscal year 2028; and(F)$75,000,000 for fiscal year 2029..5.Report to identify opportunities to streamline application process; updating guidance(a)ReportNot later than 1 year after the date of enactment of this Act, the Administrator of the Environmental Protection Agency (referred to in this section as the Administrator) shall submit to Congress a report that evaluates the application ranking criteria and approval process for grants and loans under section 104(k) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)), which shall include, with respect to those grants and loansā(1)an evaluation of the shortcomings in the existing application requirements that are a recurring source of confusion for potential recipients of those grants or loans;(2)an identification of the most common sources of point deductions on application reviews;(3)strategies to incentivize the submission of applications from small communities and disadvantaged areas (as those terms are defined in section 128(a)(1)(B)(iv) of that Act (42 U.S.C. 9628(a)(1)(B)(iv)); and(4)recommendations, if any, to Congress on suggested legislative changes to the ranking criteria that would achieve the goal of streamlining the application process for small communities and disadvantaged areas (as so defined).(b)Updating guidanceNot later than 1 year after the date of enactment of this Act, the Administrator shall update the guidance relating to the application ranking criteria and approval process for grants and loans under section 104(k) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)) to reduce the complexity of the application process while ensuring competitive integrity.6.Brownfield revitalization funding for Alaska Native tribesSection 104(k)(1) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)(1)) is amendedā(1)in subparagraph (G), by striking other than in Alaska; and(2)by striking subparagraph (H) and inserting the following:(H)a Regional Corporation or a Village Corporation (as those terms are defined in section 3 of the Alaska Native Claims Settlement Act (43 U.S.C. 1602));.September 27 (legislative day, September 22), 2023Read twice and placed on the calendar'
''' APPLYING THE XML PARSER '''
title_list = []
text_list = []
for url in url_list:
try:
title ,text = xml_searcher( url)
title_list.append(title)
text_list.append(text)
except:
title_list.append("ERROR")
text_list.append("ERROR")
## Checking that everything went smoothly
print (len(title_list))
print (len(text_list))
3262
3262
bill_information_final['Bill Title (XML)'] = title_list
bill_information_final['Bill Text'] = text_list
## Saving just in case :)
bill_information.to_csv("Bill Information Supplemented.csv")
š·š»āāļø 3. Data Structuring
At this stage, the data will not be cleaned and will be done so in a subsequent notebook. The data is structured using the provided query from the API or webscrape and is not altered.
3.1 News API
3.1.1 NewsAPI: Everything
''' DATAFRAME GENERATION - DEMOCRAT '''
concatenate_democrats = []
for dataset in nested_responses_democrat:
current_dataframe = pd.DataFrame(dataset)
concatenate_democrats.append(current_dataframe)
democrat_articles = pd.concat(concatenate_democrats)
print (len(democrat_articles))
444
democrat_articles.head(2)
source | author | title | description | url | urlToImage | publishedAt | content | |
---|---|---|---|---|---|---|---|---|
0 | {'id': None, 'name': 'CNET'} | Katie Collins | For Progress on Climate and Energy in 2025, Th... | As Trump and his anti-science agenda head for ... | https://www.cnet.com/home/energy-and-utilities... | https://www.cnet.com/a/img/resize/5fa89cffd3d5... | 2025-01-03T13:00:00Z | With its sprawling canopy of magnolia, dogwood... |
1 | {'id': 'time', 'name': 'Time'} | Will Weissert and Chris Megerian / AP | Trump to Visit Disaster-Stricken California an... | President TrumpĀ is heading toĀ hurricane-batter... | https://time.com/7209700/trump-los-angeles-wil... | https://api.time.com/wp-content/uploads/2025/0... | 2025-01-24T06:30:00Z | WASHINGTON President Donald Trump is heading t... |
democrat_articles.describe()
source | author | title | description | url | urlToImage | publishedAt | content | |
---|---|---|---|---|---|---|---|---|
count | 444 | 402 | 444 | 443 | 444 | 319 | 444 | 444 |
unique | 119 | 285 | 421 | 429 | 444 | 315 | 424 | 441 |
top | {'id': None, 'name': 'Freerepublic.com'} | Breitbart | Democrat Sen. Markey: L.A. Fires Are āClimate ... | Democrat Massachusetts Sen. Ed Markey has clai... | https://www.cnet.com/home/energy-and-utilities... | https://static.dw.com/image/71400751_6.jpg | 2025-01-25T05:00:00Z | In a confirmation hearing on Thursday, Democra... |
freq | 119 | 14 | 4 | 4 | 1 | 3 | 3 | 2 |
''' DATAFRAME GENERATION - REPUBLICAN '''
concatenate_republicans = []
for dataset in nested_responses_republican:
current_dataframe = pd.DataFrame(dataset)
concatenate_republicans.append(current_dataframe)
republican_articles = pd.concat(concatenate_republicans)
republican_articles.head(2)
source | author | title | description | url | urlToImage | publishedAt | content | |
---|---|---|---|---|---|---|---|---|
0 | {'id': 'the-verge', 'name': 'The Verge'} | Nilay Patel | Trumpās first 100 days: all the news impacting... | President Donald Trump is taking on TikTok, el... | https://www.theverge.com/24348851/donald-trump... | https://cdn.vox-cdn.com/thumbor/Nwo4_i4giY8lRM... | 2025-01-22T14:30:00Z | Filed under:\r\nByLauren Feiner, a senior poli... |
1 | {'id': None, 'name': 'Gizmodo.com'} | Kate Yoder, Grist | The Quiet Death of Bidenās Climate Corpsāand W... | Biden's green jobs program was never what it s... | https://gizmodo.com/the-quiet-death-of-bidens-... | https://gizmodo.com/app/uploads/2025/01/Americ... | 2025-01-18T15:00:26Z | Giorgio Zampaglione loved his two-hour commute... |
republican_articles.describe()
source | author | title | description | url | urlToImage | publishedAt | content | |
---|---|---|---|---|---|---|---|---|
count | 376 | 350 | 376 | 376 | 376 | 373 | 376 | 376 |
unique | 93 | 267 | 375 | 365 | 376 | 369 | 371 | 370 |
top | {'id': None, 'name': 'Forbes'} | ABC News | US to withdraw from Paris agreement, expand dr... | Organizations including Walmart, Loweās and Me... | https://www.theverge.com/24348851/donald-trump... | https://imageio.forbes.com/specials-images/ima... | 2025-01-16T10:00:00Z | <ul><li>Trump endorses House Speaker Mike John... |
freq | 29 | 18 | 2 | 3 | 1 | 3 | 2 | 4 |
''' DATA STORAGE
This is saving all of the raw data (minimal structuring) to their respective CSV files.
'''
## Everything endpoint
republican_articles.to_csv("NEWSAPI - republican climate articles raw.csv")
democrat_articles.to_csv("NEWSAPI - democrat climate articles raw.csv")
3.1.2 NewsAPI sources
The sources for each type will be removed from each headline with their authors. This is to better understand who is representing and providing narrative to each party.
democrat_sources = democrat_articles[['source','author']]
republican_sources = republican_articles[['source','author']]
len(democrat_sources)
444
democrat_sources.head(2)
source | author | |
---|---|---|
0 | {'id': None, 'name': 'CNET'} | Katie Collins |
1 | {'id': 'time', 'name': 'Time'} | Will Weissert and Chris Megerian / AP |
republican_sources.head(2)
source | author | |
---|---|---|
0 | {'id': 'the-verge', 'name': 'The Verge'} | Nilay Patel |
1 | {'id': None, 'name': 'Gizmodo.com'} | Kate Yoder, Grist |
''' SOURCE CLEANER:
INPUT: a list of sources in a dictionary structure with the key 'name'
OUTPUT: two lists: the first contains the entire list of sources
cleaned, and the second is only the unique sources
The purpose of this function is to clean the results from the everything
endpoint from the NewsAPI. The result will be a list (can be appened)
to a new dataframe to match the authors of clean sources. The second
return of the function is a unique list of sources
'''
def source_cleaner(source_list):
## Creating a storage container for the cleaned sources
cleaned_sources = []
## Iterating through each source in the provided list
for source in source_list:
## Obtaining the name + storing it
current_source = source['name']
cleaned_sources.append(current_source)
## Finding the Unique Sources from the list
unique_sources = list(set(cleaned_sources))
return (cleaned_sources,unique_sources)
dem_sources_full = democrat_sources['source'].to_list().copy()
rep_sources_full = republican_sources['source'].to_list().copy()
dem_cleaned_sources,dem_unique_sources = source_cleaner(dem_sources_full)
rep_cleaned_sources,rep_unique_sources = source_cleaner(rep_sources_full)
democrat_sources['source'] = dem_cleaned_sources
republican_sources['source'] = rep_cleaned_sources
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\2483140677.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
democrat_sources['source'] = dem_cleaned_sources
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\2483140677.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
republican_sources['source'] = rep_cleaned_sources
dem_list = []
for i in range(0,len(democrat_sources)):
dem_list.append("Democrat")
democrat_sources['Party'] = dem_list
rep_list = []
for i in range(0,len(republican_sources)):
rep_list.append("Republican")
republican_sources['Party'] = rep_list
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\654442435.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
democrat_sources['Party'] = dem_list
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\654442435.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
republican_sources['Party'] = rep_list
democrat_sources.fillna('No Author',inplace=True)
republican_sources.fillna('No Author',inplace=True)
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\1565605381.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
democrat_sources.fillna('No Author',inplace=True)
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\1565605381.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
republican_sources.fillna('No Author',inplace=True)
democrat_sources
source | author | Party | |
---|---|---|---|
0 | CNET | Katie Collins | Democrat |
1 | Time | Will Weissert and Chris Megerian / AP | Democrat |
2 | Politicopro.com | Blanca Begert, Camille von Kaenel, Thomas Fran... | Democrat |
3 | Scientific American | Tanya Lewis | Democrat |
4 | Vox | Benji Jones | Democrat |
... | ... | ... | ... |
95 | PBS | Will Weissert, Associated Press, Amelia Thomso... | Democrat |
96 | PBS | Michelle L. Price, Associated Press | Democrat |
97 | PBS | Bernard McGhee, Associated Press | Democrat |
98 | The Times of India | Navtej Sarna | Democrat |
99 | The Times of India | AP | Democrat |
444 rows Ć 3 columns
republican_sources
source | author | Party | |
---|---|---|---|
0 | The Verge | Nilay Patel | Republican |
1 | Gizmodo.com | Kate Yoder, Grist | Republican |
2 | BBC News | No Author | Republican |
3 | BBC News | No Author | Republican |
4 | BBC News | No Author | Republican |
... | ... | ... | ... |
75 | MSNBC | Jasen Castillo, John Schuessler, Miranda Priebe | Republican |
76 | MSNBC | Jen Psaki | Republican |
77 | Themorningnews.org | The Morning News | Republican |
78 | Finextra | Editorial Team | Republican |
79 | Japan Today | No Author | Republican |
376 rows Ć 3 columns
all_sources = pd.concat([democrat_sources,republican_sources])
4. Supplementary Media
The supplementary media collected in this section is two PDF files from each of the respective parties. The party platform is the document produced by each party, to proclaim their goals and resolutions if they take office.
dem_pdf= pypdf.PdfReader(r"C:\Users\natal\OneDrive\university\info 5653\data\2024_democratic_party_platform.pdf",strict=True)
rep_pdf = pypdf.PdfReader(r"C:\Users\natal\OneDrive\university\info 5653\data\2024_republican_party_platform.pdf")
''' EXPLORING THE METADATA'''
dem_pdf.metadata
{'/Title': 'FINAL MASTER PLATFORM',
'/Producer': 'Skia/PDF m129 Google Docs Renderer'}
dem_pdf.
<bound method PdfReader.decode_permissions of <pypdf._reader.PdfReader object at 0x00000206AB5438B0>>
len(dem_pdf.pages)
92
rep_pdf.metadata
{'/CreationDate': "D:20240710083033-05'00'",
'/Creator': 'Adobe InDesign 19.4 (Macintosh)',
'/ModDate': "D:20240710083036-05'00'",
'/Producer': 'Adobe PDF Library 17.0',
'/Trapped': '/False'}
len(rep_pdf.pages)
28
4.3.1 Extracting Text
''' A SIMPLE EXTRACTION TEXT'''
page1 = rep_pdf.pages[0]
print(page1.extract_text())
4343RDRD REPUBLICAN NATIONAL CONVENTION REPUBLICAN NATIONAL CONVENTION
PLATFORMTHE 2024 REPUBLICAN
MAKE AMERICA GREAT AGAIN!
''' EXTRACTING TEXT FROM DEMOCRAT DOCUMENT'''
democrat_party_platform = []
for page in range(0,len(dem_pdf.pages)):
current_page = dem_pdf.pages[page]
current_text = current_page.extract_text()
democrat_party_platform.append(current_text)
''' EXTRACTING TEXT FROM REPUBLICAN DOCUMENT'''
republican_party_platform = []
for page in range(0,len(rep_pdf.pages)):
current_page = rep_pdf.pages[page]
current_text = current_page.extract_text()
republican_party_platform.append(current_text)
''' COMBINING INTO ONE TEXT PER PARTY '''
democrat_party_platform_all = ' '.join(democrat_party_platform)
republican_party_platform_all = ' '.join(republican_party_platform)
''' SAVING THE RAW TEXTS AS .TXTS'''
with open("democrat_party_platform.txt", "w",errors='replace') as file:
file.write(democrat_party_platform_all)
with open("republican_party_platform.txt", "w",errors='replace') as file:
file.write(republican_party_platform_all)