Political Stances - Creating Basket Data
2. Creating Basket Data + DTM Data
The purpose of this notebook is to generate basket data, which is good for clusering and association roles mining. The desired shape of the table has each row is a document, and the columns are the individual words. Finally, the output will be in a csv format.
The data used in this notebook are from the Count Vectorizer. This is used so that the values in the CountVectorizer can be converted to binary, and then the row can be cleaned.
1. Environment Creation
1.1 Library Import
import pandas as pd
import os
from tqdm.notebook import tqdm
import csv
1.2 Data Import
bill_information = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\Bills Lemmed- Count Vectorizer.csv")
bill_information.head(2)
Bill Type | Sponser Affiliation | Sponser State | Committees | aa | aaa | aarhu | ab | abandon | ... | |
---|---|---|---|---|---|---|---|---|---|---|
0 | hr | D | HI | House - Natural Resources, Agriculture | Senat... | 0 | 0 | 0 | 0 | 0 | ... |
1 | hr | R | NY | House - Agriculture | 0 | 0 | 0 | 0 | 0 | ... |
2 rows × 15494 columns
news = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\News Articles Lemmed- Count Vectorizer.csv")
news.head(2)
Party | publisher | aapi | abandon | abandoned | abc | ability | able | abolish | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | Republican | The Verge | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
1 | Republican | Gizmodo.com | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
2 rows × 2361 columns
party_platform = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\Party Platform Lemmed- Count Vectorizer.csv")
party_platform.head(2)
Party | ability | able | abortion | access | accessible | according | accountability | accountable | ... | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Republican | 1 | 1 | 1 | 4 | 1 | 1 | 1 | 4 | ... |
1 | Democrat | 7 | 13 | 13 | 72 | 15 | 1 | 6 | 14 | ... |
2 rows × 894 columns
1.2.2 Light Data Cleaning
The data utilized in this notebook has already undergone extensive cleaning processes, which may be viewed on my project page. In this section, the labels will be removed and the unnamed column will be dropped.
bill_information.drop(columns=['Unnamed: 0','Sponser State','Bill Type','Sponser Affiliation','Committees'],inplace=True)
bill_information.drop(columns=['helvetica','noto','sego','neue','vh','html','webkit','emoji','blinkmacsystemfont','arial','roboto','ui','serif',
'column','font','pad','width','auto','left','height'],inplace=True)
news.drop(columns=['Unnamed: 0','Party','publisher'],inplace=True)
party_platform.drop(columns=['Unnamed: 0','Party'],inplace=True)
2. Creating Basket Data
2.1 Assembling the Transactions
def transaction_creator(index,basket):
transcation_dictionary = {key:val for key, val in basket[index].items() if val != 0.0}
items = list(transcation_dictionary.keys())
return (items)
news_basket = news.to_dict(orient='records')
bills_basket = bill_information.to_dict(orient='records')
party_basket = party_platform.to_dict(orient='records')
baskets = [news_basket,bills_basket,party_basket]
transactions = []
for current_basket in baskets:
for index in tqdm(range(0,len(current_basket)),desc='🛒🐛... | inching through the store'):
transactions.append(transaction_creator(index,current_basket))
🛒🐛... | inching through the store: 0%| | 0/820 [00:00<?, ?it/s]
🛒🐛... | inching through the store: 0%| | 0/3261 [00:00<?, ?it/s]
🛒🐛... | inching through the store: 0%| | 0/2 [00:00<?, ?it/s]
## And checking to make sure it worked...
transactions[0]
['ai',
'backing',
'billionaire',
'day',
'donald',
'electric',
'ha',
'impacting',
'industry',
'news',
'office',
'policy',
'president',
'taking',
'tech',
'tiktok',
'time',
'trump',
'vehicle']
2.2 Saving the Basket Data
''' WRITING TO A CSV '''
with open('Basket Data.csv', 'w', newline='\n') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(transactions)
3. Creating Document Term Matrix Data
dtm_df = pd.concat([bill_information,news, party_platform])
dtm_df.head()
aa | aaa | aarhu | ab | abandon | abandonth | abat | abbrevi | abercrombi | abey | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
5 rows × 17198 columns
dtm_df = dtm_df.fillna(0)
dtm_df.reset_index(inplace=True,drop=True)
dtm_df.head(1)
aa | aaa | aarhu | ab | abandon | abandonth | abat | abbrevi | abercrombi | abey | ... | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... |
1 rows × 17198 columns
dtm_df.to_csv("Document Term Matrix.csv")