0. Data Collection


Author: Natalie Castro Date: 1/15/2025

The purpose of this notebook is to collect varying forms of data from different sources on the Internet to answer the research question:

What are characteristics of self-identified political parities expressed in 2025 in climate change?

🌐1. Environment Creation

1.1 Library Import

''' DATA QUERYING '''
from bs4 import BeautifulSoup
import json
import requests
from time import sleep
import pypdf

''' DATA MANAGEMENT '''
import pandas as pd
import regex as re
C:\Users\natal\miniconda3\lib\site-packages\pypdf\_crypt_providers\_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from this module in 48.0.0.
  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4

1.2 Secret Storage

''' NEWSAPI KEY'''
api_key = 'xxxxxxxxxxxxxxxxxxxx'

šŸ“”2. API Requests

The keywords used in this analysis will be ā€œdemocrat+climateā€ and ā€œrepublican+climateā€

2.1 NewsAPI

For NewsAPI, three types of data will be collected. First, using the ā€˜everything’ endpoint which collects every article in the past five years from their corpora using a keyword. Next, will be the top headlines on January 20th, President Donald Trump’s inaguration day. The final endpoint used will be the ā€˜sources’ to better undestand media production for each key word during this date.

2.1.1 NewsAPI: Everything Endpoint

The code shown below is how the I got the API up and running, this was then iterated upon with the parameter ā€˜page’ to extract the entire set of results.

''' BUIDLING THE URL '''
base_url = "https://newsapi.org/v2/everything?"

url_post = {'apiKey':api_key,
            'source':'everything',
            'q':'democrat+climate', ## this iteration will be looking at democrat referencing articles
            'language':'en', ## selecting English as the data
            'sortBy':'popularity', ## used to generate a popularity label
}

url_post2 = {'apiKey':api_key,
            'source':'everything',
            'q':'republican+climate', ## this iteration will be looking at democrat referencing articles
            'language':'en', ## selecting English as the data
            'sortBy':'popularity', ## used to generate a popularity label
    
}
''' MAKING THE REQUEST '''

response = requests.get(base_url,url_post)
''' CHECKING OUT THE RESPONSE: DEMOCRAT '''
text = response.json()
dem_text['articles'][0]
{'source': {'id': None, 'name': 'CNET'},
 'author': 'Katie Collins',
 'title': 'For Progress on Climate and Energy in 2025, Think Local',
 'description': "As Trump and his anti-science agenda head for the White House, look to America's city and state leaders to drive climate action and prioritize clean energy.",
 'url': 'https://www.cnet.com/home/energy-and-utilities/for-progress-on-climate-and-energy-in-2025-think-local/',
 'urlToImage': 'https://www.cnet.com/a/img/resize/5fa89cffd3d573f39b4cf70398e5bb4b3038a2d7/hub/2024/12/31/540a7c2e-63f9-445f-a311-5744bcce16a2/us-map-localized-energy-progress.jpg?auto=webp&fit=crop&height=675&width=1200',
 'publishedAt': '2025-01-03T13:00:00Z',
 'content': 'With its sprawling canopy of magnolia, dogwood, southern pine and oak trees, Atlanta is known as the city in the forest. The lush vegetation helps offset the pollution from the commuter traffic as pe… [+16751 chars]'}
rep_text = republican_response.json()
rep_text.keys()
dict_keys(['status', 'totalResults', 'articles'])
rep_text['totalResults']
1559
rep_text['articles'][0]
{'source': {'id': 'the-verge', 'name': 'The Verge'},
 'author': 'Nilay Patel',
 'title': 'Trump’s first 100 days: all the news impacting the tech industry',
 'description': 'President Donald Trump is taking on TikTok, electric vehicle policy, and AI in his first 100 days in office. This time around, he has the backing of many tech billionaires.',
 'url': 'https://www.theverge.com/24348851/donald-trump-presidency-tech-science-news',
 'urlToImage': 'https://cdn.vox-cdn.com/thumbor/Nwo4_i4giY8lRM0Rtzih1IHTSLU=/0x0:2040x1360/1200x628/filters:focal(1020x680:1021x681)/cdn.vox-cdn.com/uploads/chorus_asset/file/25531809/STK175_DONALD_TRUMP_CVIRGINIA_C.jpg',
 'publishedAt': '2025-01-22T14:30:00Z',
 'content': 'Filed under:\r\nByLauren Feiner, a senior policy reporter at The Verge, covering the intersection of Silicon Valley and Capitol Hill. She spent 5 years covering tech policy at CNBC, writing about antit… [+7943 chars]'}

The structure of the response is a nested dictionary, with each list entry in the response a dictionary for the respective news article.

šŸ«Iterating for Democrat Articles

''' PAGE TURNER 

    INPUT: the desired page for the API call, the keyword used to build the URL
    OUTPUT: a list of the articles from the page
    
    The function page_turner is used to collect the entire corpus from the NEWSAPI
    for a particular keyword. This function is used wrapped into a for loop so 
    it can build a new url for each distinct page.

'''

base_url = "https://newsapi.org/v2/everything?"


def page_turner(page_number,keyword):
    sleep(2)
    ## Building the post URL for every page in the iteration
    url_post = {'apiKey':api_key,
            'source':'everything',
            'q':keyword, ## this iteration will be looking at democrat referencing articles
            'language':'en', ## selecting English as the data
            'sortBy':'popularity', ## used to generate a popularity label
            'page':page_number}
    
    
    response = requests.get(base_url,url_post)
    json_ = response.json()
    #print (json_.keys())
    return(json_['articles'])
nested_responses_democrat = []

for page in range(1,7):
    page_contents = page_turner(page,'democrat+climate')
    nested_responses_democrat.append(page_contents)
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Cell In[20], line 4
      1 nested_responses_democrat = []
      3 for page in range(1,7):
----> 4     page_contents = page_turner(page,'democrat+climate')
      5     nested_responses_democrat.append(page_contents)


Cell In[13], line 29, in page_turner(page_number, keyword)
     27 json_ = response.json()
     28 #print (json_.keys())
---> 29 return(json_['articles'])


KeyError: 'articles'
nested_responses_republican = []

for page in range(1,17):
    page_contents = page_turner(page,'republican+climate')
    nested_responses_republican.append(page_contents)
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

Cell In[31], line 4
      1 nested_responses_republican = []
      3 for page in range(1,17):
----> 4     page_contents = page_turner(page,'republican+climate')
      5     nested_responses_republican.append(page_contents)


Cell In[13], line 29, in page_turner(page_number, keyword)
     27 json_ = response.json()
     28 #print (json_.keys())
---> 29 return(json_['articles'])


KeyError: 'articles'
len(nested_responses_republican)
5

šŸ¦… 2.2 Congress.Gov API

The purpose of collecting data from this API is to understand how different political sides have represented + instutionalized their view about climate change.

GitHub Documentation
Using Congress Data Offsite
Congress API Endpoints
Python Code Examples

api_key = 'XXX'
base_url = 'https://api.congress.gov/v3/bill?api_key=te7ilzFKEeAOrjfEalH5mrtFU0Dw35E6B70Nfhnn'

url_post = {
            'format':'json', # specifying the response format
            'offset':0, ## specifying the start of the records returned,
            'limit':10 ## specifying the number of records returned
            }

## For government APIs, it's generally good practice to provide some sort of user agent!
user_agent = {'user-agent': 'University of Colorado at Boulder, natalie.castro@colorado.edu'}
''' TEST 2'''
test2_response = requests.get(base_url,url_post,headers=user_agent)
test2_response
<Response [200]>

2.2.1 Collecting Bill Numbers

To do so, I will be changing the URL for a few different parameters and scraping the congress site. The filters generated are: legislation any status of legislation, and environmental protection policy area.

The URL for the bill search (as of 1/28/2025) is:https://www.congress.gov/search?q=%7B%22congress%22%3A%22all%22%2C%22source%22%3A%22all%22%2C%22bill-status%22%3A%22all%22%2C%22subject%22%3A%22Environmental+Protection%22%7D

A CSV was downloaded with the bill numbers with a total of 8,056 bills. The original downloaded CSV comes with three ā€˜metadata’ lines, these were consisted of the date collected, and the URL, which I have listed here. The three lines were deleted to read them in using Pandas.

bill_information1 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_119_113.csv",encoding='utf-8')
bill_information2 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_112_103.csv",encoding='utf-8')
bill_information3 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_102_95.csv",encoding='utf-8')
bill_information4 = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\epa_bills_94_93.csv",encoding='utf-8')

bill_information = pd.concat([bill_information1,bill_information2,bill_information3,bill_information4])
bill_information.head()
Legislation Number URL Congress Title Sponsor Date of Introduction Committees Latest Action Latest Action Date Number of Cosponsors Amends Bill Date Offered Date Submitted Date Proposed Amends Amendment
0 H.R. 375 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Continued Rapid Ohia Death Response Act of 2025 Tokuda, Jill N. [Rep.-D-HI-2] (Introduced 01/1... 1/13/2025 House - Natural Resources, Agriculture | Senat... Received in the Senate and Read twice and refe... 1/24/2025 1 NaN NaN NaN NaN NaN
1 H.R. 349 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Goldie’s Act Malliotakis, Nicole [Rep.-R-NY-11] (Introduced... 1/13/2025 House - Agriculture Referred to the House Committee on Agriculture. 1/13/2025 6 NaN NaN NaN NaN NaN
2 H.R. 313 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Natural Gas Tax Repeal Act Pfluger, August [Rep.-R-TX-11] (Introduced 01/... 1/9/2025 House - Energy and Commerce Referred to the House Committee on Energy and ... 1/9/2025 4 NaN NaN NaN NaN NaN
3 H.R. 288 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Long Island Sound Restoration and Stewardship ... LaLota, Nick [Rep.-R-NY-1] (Introduced 01/09/2... 1/9/2025 House - Transportation and Infrastructure, Nat... Referred to the Subcommittee on Water Resource... 1/10/2025 4 NaN NaN NaN NaN NaN
4 H.R. 284 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) GLRI Act of 2025 Joyce, David P. [Rep.-R-OH-14] (Introduced 01/... 1/9/2025 House - Transportation and Infrastructure Referred to the Subcommittee on Water Resource... 1/10/2025 28 NaN NaN NaN NaN NaN
''' CREATING CONGRESS TITLES '''
def congress_finder(current_congress): 
    ## looking for, but not including the th or st
    regex_pattern = '[0-9]{2,3}(?!=[a-z]{2})'
    congress_match = re.findall(regex_pattern,current_congress)
    
    congress_num = int(congress_match[0])
    
    return(congress_num)
bill_information['Congress Number'] = bill_information['Congress'].apply(lambda x: congress_finder(x))
''' CREATING A COLUMN FOR BILL TYPE:

The structure of all of the legislation numbers is [BILL TYPE]_[BILL NUMBER]. I will be using
Regex to extract and separate these columns
'''
''' BILL TYPE CLEANER: This function is used in a lambda apply down the rows to create the 
                        bill types needed for the congress.gov API

'''

bt_pattern = r'[A-Za-z]+\.*'
def bill_type_cleaner(bt):
    matches = re.findall(bt_pattern,bt)
    type_dirty = ''.join(matches)
    type_text = re.sub("\.","",type_dirty)
    type_clean = type_text.lower()
    return (type_clean)
bill_information['Bill Type'] = bill_information['Legislation Number'].apply(lambda x: bill_type_cleaner(x))
''' CREATING A COLUMN FOR BILL NUMBER '''
bt_num_pattern = r'[^A-Za-z\.]'
def bill_num_cleaner(bt):
    matches = re.findall(bt_num_pattern,bt)
    type_dirty = ''.join(matches)
    type_clean = type_dirty.lower().strip()
    return (int(type_clean)) ## The API asks for an integer
bill_information['Bill Number'] = bill_information['Legislation Number'].apply(lambda x: bill_num_cleaner(x))
''' CREATING A COLUMN FOR SPONSOR AFFILIATION & CREATING A COLUMN FOR SPONSOR STATE ''' '''
affiliation_pattern = r'-[DRI]'
state_pattern = r'-[A-Z]{2}'
def affiliation_finder(sponsor):
    ## For party affiliation
    match = re.findall(affiliation_pattern,sponsor)
    clean_affiliation = re.sub("-","",match[0])
    
    return (clean_affiliation)

def state_finder(sponsor):
    
    ## For State affiliation
    state_match = re.findall(state_pattern,sponsor)
    clean_state = re.sub("-",'',state_match[0])
    return (clean_state)
bill_information['Sponser Affiliation'] = bill_information['Sponsor'].apply(lambda x: affiliation_finder(x))
bill_information['Sponser State'] = bill_information['Sponsor'].apply(lambda x: state_finder(x))
bill_information.head(2)
Legislation Number URL Congress Title Sponsor Date of Introduction Committees Latest Action Latest Action Date Number of Cosponsors Amends Bill Date Offered Date Submitted Date Proposed Amends Amendment Congress Number Bill Type Bill Number Sponser Affiliation Sponser State
0 H.R. 375 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Continued Rapid Ohia Death Response Act of 2025 Tokuda, Jill N. [Rep.-D-HI-2] (Introduced 01/1... 1/13/2025 House - Natural Resources, Agriculture | Senat... Received in the Senate and Read twice and refe... 1/24/2025 1 NaN NaN NaN NaN NaN 119 hr 375 D HI
1 H.R. 349 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Goldie’s Act Malliotakis, Nicole [Rep.-R-NY-11] (Introduced... 1/13/2025 House - Agriculture Referred to the House Committee on Agriculture. 1/13/2025 6 NaN NaN NaN NaN NaN 119 hr 349 R NY
bill_information.reset_index(inplace=True)
bill_information.drop(columns='index',inplace=True)

2.2. Collecting Bill Text URLS to Scrape from the Congress API

''' BUILDING A FUNCTION FOR THE URLS '''
def url_builder(congress,bill_type,bill_number):
    base_url = "https://api.congress.gov/v3/bill/"
    
    request_url = ""
    request_url = request_url + base_url + str(congress) + "/" + bill_type + "/" + str(bill_number) + '/text?api_key=te7ilzFKEeAOrjfEalH5mrtFU0Dw35E6B70Nfhnn'
    
    return (request_url)
user_agent = {'user-agent': 'University of Colorado at Boulder, natalie.castro@colorado.edu'}

''' BUILDING A FUNCTION FOR THE API REQUESTS TO GATHER THE DATA '''
def xml_link_collector(url):
    ## Making a response with our URL
    response = requests.get(url,headers=user_agent)
    
    ## Making sure the response was valid
    try:
        out = response.json()
        collected_url = out['textVersions'][0]['formats'][2]['url'] ## This was determined through parsing the output for a test example
        return (collected_url)
    except:
        print (f"āš ļø uh oh! there was an error when using the API with this url:{url}\n ")
        return ("NO URL FOUND")
url = url_builder(119,'hjres',30)
xml_link_collector(url)
'https://www.congress.gov/119/bills/hjres30/BILLS-119hjres30ih.xml'

Collecting the Bill Information:

This will take place in two parts, because the rate limit for the congress API is 5000 requests per hour

''' SCRAPING XMLS USING THE API -- [PART 1]'''

url_list = []

for row in range(0,len(bill_information[0:4998])):
    congress = bill_information.at[row,'Congress Number']
    bill_type = bill_information.at[row,'Bill Type']
    bill_number = bill_information.at[row,'Bill Number']
    
    current_url = url_builder(congress,bill_type,bill_number)
    
    ## Making sure the API connection is well rested (ie - avoiding the rate limit)
    if row % 100 == 0:
        sleep (5)
    
    xml_found = xml_link_collector(current_url)
    
    url_list.append(xml_found)

The otuput from the above cell was removed because there was a lot of errors, and when converting to HTML I did not want to bog down the page! Here is what some of the errors looked like image.png

print (len(url_list))
4998
url_list[3249]
'https://www.congress.gov/109/bills/s2920/BILLS-109s2920is.xml'
url_list = url_list[:3250]
url_list[3249:3250]
['https://www.congress.gov/109/bills/s2920/BILLS-109s2920is.xml']
saving_urls = pd.DataFrame(url_list)
saving_urls.to_csv("Found URLs for Bills 0 - 3250.csv")
''' SCRAPING XMLS USING THE API -- [PART 2] '''

for row in range(3250,len(bill_information[3250:])):
    congress = bill_information.at[row,'Congress Number']
    bill_type = bill_information.at[row,'Bill Type']
    bill_number = bill_information.at[row,'Bill Number']
    
    current_url = url_builder(congress,bill_type,bill_number)
    
    ## Making sure the API connection is well rested (ie - avoiding the rate limit)
    if row % 100 == 0:
        sleep (5)
    
    xml_found = xml_link_collector(current_url)
    
    url_list.append(xml_found)
## Testing that all went well...

print (f"The expected length of the URL list should be {len(bill_information)}.\nThe actual length of the URL list is {len(url_list)}.")
The expected length of the URL list should be 8056.
The actual length of the URL list is 4806.

Uh-Oh! I think a lot of the older bills do not have the ability to be called for API access

To fix this, I am going to deconstruct the URLs and then only get the information on those Bills.

At this stage, I am filtering to look for what Bills do have available URLs

urls_raw = pd.DataFrame(url_list)
urls_raw.rename(columns={0:"URL"},inplace=True)
congress_pattern = '(?<=/)[0-9]{2,3}'
bt_pattern = '(?<=s/)[h,c,r,o,n,e,s,j,r]{1,7}'
bill_num_pattern = '(?<=[a-z])\d{1,4}(?!=\/BILL)'
url_congress = []
url_bt = []
url_num = []

for row in range(0,len(urls_raw)):
    curr_url = urls_raw.at[row,'URL']
    
    congress = re.findall(congress_pattern,curr_url)
    bill_type = re.findall(bt_pattern,curr_url)
    bill_num = re.findall(bill_num_pattern,curr_url)
    
    if len(congress) > 0:
        if len(bill_type) > 0:
            if len(bill_num) > 0:
                url_congress.append(int(congress[0]))
                url_bt.append(bill_type[0])
                url_num.append(int(bill_num[0]))
                
            else:
                url_num.append('DROP')
        else:
            url_bt.append('DROP')
    else:
        url_congress.append('DROP')
        url_bt.append('DROP')
        url_num.append('DROP')
    
    
urls_raw['Congress Number'] = url_congress
urls_raw['Bill Type'] = url_bt
urls_raw['Bill Number'] = url_num
## Cleaning the URLs
urls_raw.drop_duplicates(inplace=True)
## And dropping if there was any error in the regex find all
congress_condition = urls_raw['Congress Number'] != 'DROP'
bt_condition = urls_raw['Bill Type'] != 'DROP'
bn_condition = urls_raw['Bill Number'] != 'DROP'

urls_raw1 = urls_raw[congress_condition]
urls_raw2 = urls_raw1[bt_condition]
urls_clean = urls_raw2[bn_condition]
C:\Users\natal\AppData\Local\Temp\ipykernel_13272\878941524.py:7: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  urls_raw2 = urls_raw1[bt_condition]
C:\Users\natal\AppData\Local\Temp\ipykernel_13272\878941524.py:8: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  urls_clean = urls_raw2[bn_condition]
## Now mergine back from the Bill Information

bill_information_final = pd.merge(left=urls_clean,right=bill_information, on=['Congress Number','Bill Type','Bill Number'],validate='1:1')
len(bill_information_final)
3262
## Now Saving the Supplemented DataFrame 
bill_information_final.to_csv("Bill Information Supplemented.csv")
url_list = bill_information_final['API URL'].to_list()
bill_information_final.head(2)
API URL Congress Number Bill Type Bill Number Legislation Number URL Congress Title Sponsor Date of Introduction ... Latest Action Latest Action Date Number of Cosponsors Amends Bill Date Offered Date Submitted Date Proposed Amends Amendment Sponser Affiliation Sponser State
0 https://www.congress.gov/119/bills/hr375/BILLS... 119 hr 375 H.R. 375 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Continued Rapid Ohia Death Response Act of 2025 Tokuda, Jill N. [Rep.-D-HI-2] (Introduced 01/1... 1/13/2025 ... Received in the Senate and Read twice and refe... 1/24/2025 1 NaN NaN NaN NaN NaN D HI
1 https://www.congress.gov/119/bills/hr349/BILLS... 119 hr 349 H.R. 349 https://www.congress.gov/bill/119th-congress/h... 119th Congress (2025-2026) Goldie’s Act Malliotakis, Nicole [Rep.-R-NY-11] (Introduced... 1/13/2025 ... Referred to the House Committee on Agriculture. 1/13/2025 6 NaN NaN NaN NaN NaN R NY

2 rows Ɨ 21 columns

2.2.2 šŸ•øļøScraping XML URLs

To get the text for each bill, the XML URL will be scraped and appended to the dataframe we are working with.

''' XML-PARSER:

In the XML output, I am interested in the Bill Title and Text, although the title is already in the Bill Information,
I just want to make sure everything is correct! This parser will take an input of an XML url, make the request, parse 
in the soup, and return the output as a title and text!

No preprocessing will occur at this stage, and the raw text will just be appended as a column to the Bill Information. 
Later on, this will make it easier to generate the labels

'''

def xml_searcher(xml_url):
    xml_output = requests.get(xml_url)
    raw_xml = xml_output.text
    
    ## Using Beautiful Soup as a parser
    xml_soup = BeautifulSoup(raw_xml,'xml')
    
    ## Parsing for the title
    current_title = xml_soup.title
    
    ## Parsing for the text
    current_text = xml_soup.text
    
    return (current_title,current_text)
test_title ,test_text = xml_searcher(url_list[400])
test_title
<dc:title>118 S2959 RS: Brownfields Reauthorization Act of 2023</dc:title>
test_text
'\n\n118 S2959 RS: Brownfields Reauthorization Act of 2023\nU.S. Senate\n2023-09-27\ntext/xml\nEN\nPursuant to Title 17 Section 105 of the United States Code, this file is not subject to copyright protection and is in the public domain.\n\n\n\nIICalendar No. 214118th CONGRESS1st SessionS. 2959IN THE SENATE OF THE UNITED STATESSeptember 27 (legislative day, September 22), 2023Mr. Carper, from the Committee on Environment and Public Works, reported the following original bill; which was read twice and placed on the calendarA BILLTo amend the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 to reauthorize brownfields revitalization funding, and for other purposes.1.Short titleThis Act may be cited as the Brownfields Reauthorization Act of 2023.2.Improving small and disadvantaged community access to grant opportunitiesSection 104(k) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)) is amended—(1)in paragraph (1)(I), by inserting or 501(c)(6) after section 501(c)(3);(2)in paragraph (5), by striking subparagraph (E);(3)in paragraph (6)(C), by striking clause (ix) and inserting the following:(ix)The extent to which the applicant has a plan—(I)to engage a diverse set of local groups and organizations that effectively represent the views of the local community that will be directly affected by the proposed brownfield project; and(II)to meaningfully involve the local community described in subclause (I) in making decisions relating to the proposed brownfield project.;(4)in paragraph (10)(B)(iii)—(A)by striking 20 percent and inserting 10 percent;(B)by inserting the eligible entity is located in a small community or disadvantaged area (as those terms are defined in section 128(a)(1)(B)(iv)) or after unless; and(C)by inserting , in which case the Administrator shall waive the matching share requirement under this clause before ; and; and(5)in paragraph (13), by striking 2019 through 2023 and inserting 2024 through 2029.3.Increasing grant amountsSection 104(k)(3)(A)(ii) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)(3)(A)(ii)) is amended by striking $500,000 and all that follows through the period at the end and inserting $1,000,000 for each site to be remediated..4.State response programsSection 128(a) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9628(a)) is amended—(1)in paragraph (1)(B)(i), by striking or enhance and inserting , enhance, or implement; and(2)by striking paragraph (3) and inserting the following:(3)Authorization of appropriationsThere are authorized to be appropriated to carry out this subsection—(A)$50,000,000 for fiscal year 2024;(B)$55,000,000 for fiscal year 2025;(C)$60,000,000 for fiscal year 2026;(D)$65,000,000 for fiscal year 2027;(E)$70,000,000 for fiscal year 2028; and(F)$75,000,000 for fiscal year 2029..5.Report to identify opportunities to streamline application process; updating guidance(a)ReportNot later than 1 year after the date of enactment of this Act, the Administrator of the Environmental Protection Agency (referred to in this section as the Administrator) shall submit to Congress a report that evaluates the application ranking criteria and approval process for grants and loans under section 104(k) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)), which shall include, with respect to those grants and loans—(1)an evaluation of the shortcomings in the existing application requirements that are a recurring source of confusion for potential recipients of those grants or loans;(2)an identification of the most common sources of point deductions on application reviews;(3)strategies to incentivize the submission of applications from small communities and disadvantaged areas (as those terms are defined in section 128(a)(1)(B)(iv) of that Act (42 U.S.C. 9628(a)(1)(B)(iv)); and(4)recommendations, if any, to Congress on suggested legislative changes to the ranking criteria that would achieve the goal of streamlining the application process for small communities and disadvantaged areas (as so defined).(b)Updating guidanceNot later than 1 year after the date of enactment of this Act, the Administrator shall update the guidance relating to the application ranking criteria and approval process for grants and loans under section 104(k) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)) to reduce the complexity of the application process while ensuring competitive integrity.6.Brownfield revitalization funding for Alaska Native tribesSection 104(k)(1) of the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (42 U.S.C. 9604(k)(1)) is amended—(1)in subparagraph (G), by striking other than in Alaska; and(2)by striking subparagraph (H) and inserting the following:(H)a Regional Corporation or a Village Corporation (as those terms are defined in section 3 of the Alaska Native Claims Settlement Act (43 U.S.C. 1602));.September 27 (legislative day, September 22), 2023Read twice and placed on the calendar'
''' APPLYING THE XML PARSER '''
title_list = []
text_list = []

for url in url_list:
    try:
        title ,text = xml_searcher( url)

        title_list.append(title)
        text_list.append(text)
    except:
        title_list.append("ERROR")
        text_list.append("ERROR")
## Checking that everything went smoothly
print (len(title_list))
print (len(text_list))
3262
3262
bill_information_final['Bill Title (XML)'] = title_list
bill_information_final['Bill Text'] = text_list
## Saving just in case :)
bill_information.to_csv("Bill Information Supplemented.csv")

šŸ‘·šŸ»ā€ā™€ļø 3. Data Structuring

At this stage, the data will not be cleaned and will be done so in a subsequent notebook. The data is structured using the provided query from the API or webscrape and is not altered.

3.1 News API

3.1.1 NewsAPI: Everything

''' DATAFRAME GENERATION - DEMOCRAT '''
concatenate_democrats = []
for dataset in nested_responses_democrat:
    current_dataframe = pd.DataFrame(dataset)
    concatenate_democrats.append(current_dataframe)
democrat_articles = pd.concat(concatenate_democrats)
print (len(democrat_articles))
444
democrat_articles.head(2)
source author title description url urlToImage publishedAt content
0 {'id': None, 'name': 'CNET'} Katie Collins For Progress on Climate and Energy in 2025, Th... As Trump and his anti-science agenda head for ... https://www.cnet.com/home/energy-and-utilities... https://www.cnet.com/a/img/resize/5fa89cffd3d5... 2025-01-03T13:00:00Z With its sprawling canopy of magnolia, dogwood...
1 {'id': 'time', 'name': 'Time'} Will Weissert and Chris Megerian / AP Trump to Visit Disaster-Stricken California an... President TrumpĀ is heading toĀ hurricane-batter... https://time.com/7209700/trump-los-angeles-wil... https://api.time.com/wp-content/uploads/2025/0... 2025-01-24T06:30:00Z WASHINGTON President Donald Trump is heading t...
democrat_articles.describe()
source author title description url urlToImage publishedAt content
count 444 402 444 443 444 319 444 444
unique 119 285 421 429 444 315 424 441
top {'id': None, 'name': 'Freerepublic.com'} Breitbart Democrat Sen. Markey: L.A. Fires Are ā€˜Climate ... Democrat Massachusetts Sen. Ed Markey has clai... https://www.cnet.com/home/energy-and-utilities... https://static.dw.com/image/71400751_6.jpg 2025-01-25T05:00:00Z In a confirmation hearing on Thursday, Democra...
freq 119 14 4 4 1 3 3 2
''' DATAFRAME GENERATION - REPUBLICAN '''
concatenate_republicans = []
for dataset in nested_responses_republican:
    current_dataframe = pd.DataFrame(dataset)
    concatenate_republicans.append(current_dataframe)
republican_articles = pd.concat(concatenate_republicans)
republican_articles.head(2)
source author title description url urlToImage publishedAt content
0 {'id': 'the-verge', 'name': 'The Verge'} Nilay Patel Trump’s first 100 days: all the news impacting... President Donald Trump is taking on TikTok, el... https://www.theverge.com/24348851/donald-trump... https://cdn.vox-cdn.com/thumbor/Nwo4_i4giY8lRM... 2025-01-22T14:30:00Z Filed under:\r\nByLauren Feiner, a senior poli...
1 {'id': None, 'name': 'Gizmodo.com'} Kate Yoder, Grist The Quiet Death of Biden’s Climate Corps—and W... Biden's green jobs program was never what it s... https://gizmodo.com/the-quiet-death-of-bidens-... https://gizmodo.com/app/uploads/2025/01/Americ... 2025-01-18T15:00:26Z Giorgio Zampaglione loved his two-hour commute...
republican_articles.describe()
source author title description url urlToImage publishedAt content
count 376 350 376 376 376 373 376 376
unique 93 267 375 365 376 369 371 370
top {'id': None, 'name': 'Forbes'} ABC News US to withdraw from Paris agreement, expand dr... Organizations including Walmart, Lowe’s and Me... https://www.theverge.com/24348851/donald-trump... https://imageio.forbes.com/specials-images/ima... 2025-01-16T10:00:00Z <ul><li>Trump endorses House Speaker Mike John...
freq 29 18 2 3 1 3 2 4
''' DATA STORAGE

This is saving all of the raw data (minimal structuring) to their respective CSV files.

'''

## Everything endpoint
republican_articles.to_csv("NEWSAPI - republican climate articles raw.csv")
democrat_articles.to_csv("NEWSAPI - democrat climate articles raw.csv")

3.1.2 NewsAPI sources

The sources for each type will be removed from each headline with their authors. This is to better understand who is representing and providing narrative to each party.

democrat_sources = democrat_articles[['source','author']]
republican_sources = republican_articles[['source','author']]
len(democrat_sources)
444
democrat_sources.head(2)
source author
0 {'id': None, 'name': 'CNET'} Katie Collins
1 {'id': 'time', 'name': 'Time'} Will Weissert and Chris Megerian / AP
republican_sources.head(2)
source author
0 {'id': 'the-verge', 'name': 'The Verge'} Nilay Patel
1 {'id': None, 'name': 'Gizmodo.com'} Kate Yoder, Grist
''' SOURCE CLEANER:

    INPUT: a list of sources in a dictionary structure with the key 'name'
    OUTPUT: two lists: the first contains the entire list of sources 
            cleaned, and the second is only the unique sources
            
    The purpose of this function is to clean the results from the everything
    endpoint from the NewsAPI. The result will be a list (can be appened)
    to a new dataframe to match the authors of clean sources. The second
    return of the function is a unique list of sources

'''


def source_cleaner(source_list):
    ## Creating a storage container for the cleaned sources
    cleaned_sources = []
    
    ## Iterating through each source in the provided list
    for source in source_list:
        ## Obtaining the name + storing it
        current_source = source['name']
        
        cleaned_sources.append(current_source)
    
    ## Finding the Unique Sources from the list
    unique_sources = list(set(cleaned_sources))
    
    return (cleaned_sources,unique_sources)
dem_sources_full = democrat_sources['source'].to_list().copy()
rep_sources_full = republican_sources['source'].to_list().copy()
dem_cleaned_sources,dem_unique_sources = source_cleaner(dem_sources_full)
rep_cleaned_sources,rep_unique_sources = source_cleaner(rep_sources_full)
democrat_sources['source'] = dem_cleaned_sources
republican_sources['source'] = rep_cleaned_sources
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\2483140677.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  democrat_sources['source'] = dem_cleaned_sources
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\2483140677.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  republican_sources['source'] = rep_cleaned_sources
dem_list = []
for i in range(0,len(democrat_sources)):
    dem_list.append("Democrat")
    
democrat_sources['Party'] = dem_list

rep_list = []
for i in range(0,len(republican_sources)):
    rep_list.append("Republican")
    
republican_sources['Party'] = rep_list
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\654442435.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  democrat_sources['Party'] = dem_list
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\654442435.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  republican_sources['Party'] = rep_list
democrat_sources.fillna('No Author',inplace=True)
republican_sources.fillna('No Author',inplace=True)
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\1565605381.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  democrat_sources.fillna('No Author',inplace=True)
C:\Users\natal\AppData\Local\Temp\ipykernel_22124\1565605381.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  republican_sources.fillna('No Author',inplace=True)
democrat_sources
source author Party
0 CNET Katie Collins Democrat
1 Time Will Weissert and Chris Megerian / AP Democrat
2 Politicopro.com Blanca Begert, Camille von Kaenel, Thomas Fran... Democrat
3 Scientific American Tanya Lewis Democrat
4 Vox Benji Jones Democrat
... ... ... ...
95 PBS Will Weissert, Associated Press, Amelia Thomso... Democrat
96 PBS Michelle L. Price, Associated Press Democrat
97 PBS Bernard McGhee, Associated Press Democrat
98 The Times of India Navtej Sarna Democrat
99 The Times of India AP Democrat

444 rows Ɨ 3 columns

republican_sources
source author Party
0 The Verge Nilay Patel Republican
1 Gizmodo.com Kate Yoder, Grist Republican
2 BBC News No Author Republican
3 BBC News No Author Republican
4 BBC News No Author Republican
... ... ... ...
75 MSNBC Jasen Castillo, John Schuessler, Miranda Priebe Republican
76 MSNBC Jen Psaki Republican
77 Themorningnews.org The Morning News Republican
78 Finextra Editorial Team Republican
79 Japan Today No Author Republican

376 rows Ɨ 3 columns

all_sources = pd.concat([democrat_sources,republican_sources])

4. Supplementary Media

The supplementary media collected in this section is two PDF files from each of the respective parties. The party platform is the document produced by each party, to proclaim their goals and resolutions if they take office.

dem_pdf= pypdf.PdfReader(r"C:\Users\natal\OneDrive\university\info 5653\data\2024_democratic_party_platform.pdf",strict=True)
rep_pdf = pypdf.PdfReader(r"C:\Users\natal\OneDrive\university\info 5653\data\2024_republican_party_platform.pdf")
''' EXPLORING THE METADATA'''
dem_pdf.metadata
{'/Title': 'FINAL MASTER PLATFORM',
 '/Producer': 'Skia/PDF m129 Google Docs Renderer'}
dem_pdf.
<bound method PdfReader.decode_permissions of <pypdf._reader.PdfReader object at 0x00000206AB5438B0>>
len(dem_pdf.pages)
92
rep_pdf.metadata
{'/CreationDate': "D:20240710083033-05'00'",
 '/Creator': 'Adobe InDesign 19.4 (Macintosh)',
 '/ModDate': "D:20240710083036-05'00'",
 '/Producer': 'Adobe PDF Library 17.0',
 '/Trapped': '/False'}
len(rep_pdf.pages)
28

4.3.1 Extracting Text

''' A SIMPLE EXTRACTION TEXT'''
page1 = rep_pdf.pages[0]
print(page1.extract_text())
4343RDRD REPUBLICAN NATIONAL CONVENTION REPUBLICAN NATIONAL CONVENTION
PLATFORMTHE 2024 REPUBLICAN
MAKE AMERICA GREAT AGAIN!
''' EXTRACTING TEXT FROM DEMOCRAT DOCUMENT'''
democrat_party_platform = []

for page in range(0,len(dem_pdf.pages)):
    current_page = dem_pdf.pages[page]
    current_text = current_page.extract_text()
    democrat_party_platform.append(current_text)
''' EXTRACTING TEXT FROM REPUBLICAN DOCUMENT'''
republican_party_platform = []

for page in range(0,len(rep_pdf.pages)):
    current_page = rep_pdf.pages[page]
    current_text = current_page.extract_text()
    republican_party_platform.append(current_text)
''' COMBINING INTO ONE TEXT PER PARTY '''
democrat_party_platform_all = ' '.join(democrat_party_platform)
republican_party_platform_all = ' '.join(republican_party_platform)
''' SAVING THE RAW TEXTS AS .TXTS'''
with open("democrat_party_platform.txt", "w",errors='replace') as file:
    file.write(democrat_party_platform_all)
with open("republican_party_platform.txt", "w",errors='replace') as file:
    file.write(republican_party_platform_all)