2. Creating Basket Data + DTM Data

The purpose of this notebook is to generate basket data, which is good for clusering and association roles mining. The desired shape of the table has each row is a document, and the columns are the individual words. Finally, the output will be in a csv format.

The data used in this notebook are from the Count Vectorizer. This is used so that the values in the CountVectorizer can be converted to binary, and then the row can be cleaned.

1. Environment Creation

1.1 Library Import

import pandas as pd
import os
from tqdm.notebook import tqdm
import csv

1.2 Data Import

bill_information = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\Bills Lemmed- Count Vectorizer.csv")

bill_information.head(2)

	Bill Type	Sponser Affiliation	Sponser State	Committees	aa	aaa	aarhu	ab	abandon	...
0	hr	D	HI	House - Natural Resources, Agriculture \| Senat...	0	0	0	0	0	...
1	hr	R	NY	House - Agriculture	0	0	0	0	0	...

2 rows × 15494 columns

news = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\News Articles Lemmed- Count Vectorizer.csv")

news.head(2)

	Party	publisher	aapi	abandon	abandoned	abc	ability	able	abolish
0	Republican	The Verge	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...
1	Republican	Gizmodo.com	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...

2 rows × 2361 columns

party_platform = pd.read_csv(r"C:\Users\natal\OneDrive\university\info 5653\data\Party Platform Lemmed- Count Vectorizer.csv")

party_platform.head(2)

	Party	ability	able	abortion	access	accessible	according	accountability	accountable	...
0	Republican	1	1	1	4	1	1	1	4	...
1	Democrat	7	13	13	72	15	1	6	14	...

2 rows × 894 columns

1.2.2 Light Data Cleaning

The data utilized in this notebook has already undergone extensive cleaning processes, which may be viewed on my project page. In this section, the labels will be removed and the unnamed column will be dropped.

bill_information.drop(columns=['Unnamed: 0','Sponser State','Bill Type','Sponser Affiliation','Committees'],inplace=True)

bill_information.drop(columns=['helvetica','noto','sego','neue','vh','html','webkit','emoji','blinkmacsystemfont','arial','roboto','ui','serif',
                              'column','font','pad','width','auto','left','height'],inplace=True)

news.drop(columns=['Unnamed: 0','Party','publisher'],inplace=True)

party_platform.drop(columns=['Unnamed: 0','Party'],inplace=True)