most common words for each sector and visualize

PHOTO EMBED

Sun Aug 22 2021 18:21:27 GMT+0000 (Coordinated Universal Time)

Saved by @QuinnFox12 #chinese #visualization

type_of_label = set(data_train['label'])
# stop = stopwords.words('english')
# stop.append("the")
# stop.append("company")
# stop_words=set(stop)
label_company = dict()
label_other = dict()
index = 0
for s in type_of_companies:
    label_company[index]=s
    label_other[s]=index
    index+=1


for s in type_of_companies:
    df=data_train[data_train['label'] == s]
    email=''
    for i in df.index: 
        email+=df["text"][i]
    tokenizer = RegexpTokenizer(r'\w+')
    filtered_sentence=[]
    word_tokens = tokenizer.tokenize(email)
    for w in word_tokens:
        if w.lower() not in stop:
            filtered_sentence.append(w.lower())

    fdist2 = FreqDist(filtered_sentence)
    fdist2.plot(10,cumulative=False,title='Frequency for '+str(s))
content_copyCOPY

Each type of company is defined by a number and stored in a dictionary. This will be more useful when learning. Now we want to see if for each sector of activity there are words that are frequent in their description. We will therefore count the most common words for each sector and visualize them using a curve. We choose to focus on the 10 most important words each time to have a sufficient sample, but this choice is arbitrary.