2. Creating Basket Data + DTM Data

The purpose of this notebook is to generate basket data, which is good for clusering and association roles mining. The desired shape of the table has each row is a document, and the columns are the individual words. Finally, the output will be in a csv format.

The data used in this notebook are from the Count Vectorizer. This is used so that the values in the CountVectorizer can be converted to binary, and then the row can be cleaned.

1. Environment Creation

1.1 Library Import

import pandas as pd
import os
from tqdm.notebook import tqdm
import csv

1.2 Data Import

bill_information = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\Bills Lemmed- Count Vectorizer.csv")
bill_information.head(2)
Bill Type Sponser Affiliation Sponser State Committees aa aaa aarhu ab abandon ...
0 hr D HI House - Natural Resources, Agriculture | Senat... 0 0 0 0 0 ...
1 hr R NY House - Agriculture 0 0 0 0 0 ...

2 rows × 15494 columns

news = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\News Articles Lemmed- Count Vectorizer.csv")
news.head(2)
Party publisher aapi abandon abandoned abc ability able abolish
0 Republican The Verge 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
1 Republican Gizmodo.com 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

2 rows × 2361 columns

party_platform = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\Party Platform Lemmed- Count Vectorizer.csv")
party_platform.head(2)
Party ability able abortion access accessible according accountability accountable ...
0 Republican 1 1 1 4 1 1 1 4 ...
1 Democrat 7 13 13 72 15 1 6 14 ...

2 rows × 894 columns

1.2.2 Light Data Cleaning

The data utilized in this notebook has already undergone extensive cleaning processes, which may be viewed on my project page. In this section, the labels will be removed and the unnamed column will be dropped.

bill_information.drop(columns=['Unnamed: 0','Sponser State','Bill Type','Sponser Affiliation','Committees'],inplace=True)
bill_information.drop(columns=['helvetica','noto','sego','neue','vh','html','webkit','emoji','blinkmacsystemfont','arial','roboto','ui','serif',
                              'column','font','pad','width','auto','left','height'],inplace=True)
news.drop(columns=['Unnamed: 0','Party','publisher'],inplace=True)
party_platform.drop(columns=['Unnamed: 0','Party'],inplace=True)

2. Creating Basket Data

2.1 Assembling the Transactions

def transaction_creator(index,basket):
    transcation_dictionary = {key:val for key, val in basket[index].items() if val != 0.0}
    items = list(transcation_dictionary.keys())
    return (items)
news_basket = news.to_dict(orient='records')
bills_basket = bill_information.to_dict(orient='records')
party_basket = party_platform.to_dict(orient='records')
baskets = [news_basket,bills_basket,party_basket]
transactions = []

for current_basket in baskets:
    for index in tqdm(range(0,len(current_basket)),desc='🛒🐛... | inching through the store'):
        transactions.append(transaction_creator(index,current_basket))
        
🛒🐛... | inching through the store:   0%|          | 0/820 [00:00<?, ?it/s]



🛒🐛... | inching through the store:   0%|          | 0/3261 [00:00<?, ?it/s]



🛒🐛... | inching through the store:   0%|          | 0/2 [00:00<?, ?it/s]
## And checking to make sure it worked...
transactions[0]
['ai',
 'backing',
 'billionaire',
 'day',
 'donald',
 'electric',
 'ha',
 'impacting',
 'industry',
 'news',
 'office',
 'policy',
 'president',
 'taking',
 'tech',
 'tiktok',
 'time',
 'trump',
 'vehicle']

2.2 Saving the Basket Data

''' WRITING TO A CSV '''

with open('Basket Data.csv', 'w', newline='\n') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(transactions)

3. Creating Document Term Matrix Data

dtm_df = pd.concat([bill_information,news, party_platform])
dtm_df.head()
aa aaa aarhu ab abandon abandonth abat abbrevi abercrombi abey ...
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

5 rows × 17198 columns

dtm_df = dtm_df.fillna(0)
dtm_df.reset_index(inplace=True,drop=True)
dtm_df.head(1)
aa aaa aarhu ab abandon abandonth abat abbrevi abercrombi abey ...
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

1 rows × 17198 columns

dtm_df.to_csv("Document Term Matrix.csv")