Punishiment or prevention?: A machine learning analysis
of drug-related news coverage across cultures

#3-2 Title-Comment Count Feature Comparison

Comparison between Groups with Comments and without Comments

  • After analyzing article content, attempted feature extraction from article headlines, yielding clearer results
  • Suggests public engagement varies based on headline characteristics
  • Process follows similar methods as #3-1 (doc2bow-LDA topic extraction, TF-IDF, word2vec score comparison, wordcloud analysis)
!pip install gensim
import pandas as pd

# Load the modified data
data = pd.read_csv("final_combined.csv")

# Remove commas from the comment_count column
# data['comment_count'] = data['comment_count'].str.replace(',', '')

# Convert the comment_count column to floats
data['comment_count'] = data['comment_count'].astype(float)

# Fill missing values with 0
data['comment_count'].fillna(0, inplace=True)

# Convert the comment_count column to integers
data['comment_count'] = data['comment_count'].astype(int)

data_without = data[data['comment_count'] == 0]
data_with = data[data['comment_count'] >= 1]

# Save the classified data to separate CSV files
data_without.to_csv("data_without_comments4.csv", index=False)
data_with.to_csv("data_with_comments4.csv", index=False)
print(len(data_without))
print(len(data_with))
306
921

The code above uses Word2Vec to identify keywords emphasized more in group B versus group A.
It calculates keyword similarity scores and mean word embeddings between groups to identify terms more prominent in group B.
While differences exist, they were deemed too minimal or insufficiently supported to be considered significant characteristics.

# Read CSV files
data_without = pd.read_csv("data_without_comments4.csv") 
data_with = pd.read_csv("data_with_comments4.csv")

# Clean text function: Remove non-Korean characters
def text_cleaning(text):
   hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
   result = hangul.sub('', text)
   return result

def split_text(text):
   return text.split()

# Clean text and tokenize
data_without['title_cleaned'] = data_without['title_tokenized'].apply(text_cleaning)
data_with['title_cleaned'] = data_with['title_tokenized'].apply(text_cleaning)

data_without['processed'] = data_without['title_cleaned'].apply(split_text) 
data_with['processed'] = data_with['title_cleaned'].apply(split_text)

# Create dictionaries and corpora
dictionary_data_without = Dictionary(data_without['processed'])
dictionary_data_with = Dictionary(data_with['processed'])

corpus_data_without = [dictionary_data_without.doc2bow(doc) for doc in data_without['processed']]
corpus_data_with = [dictionary_data_with.doc2bow(doc) for doc in data_with['processed']]

# Set topic range (4-20)
topic_range = range(4, 20, 1)

# Compute perplexity
def compute_perplexity(dictionary, corpus, num_topics):
   model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42)
   return model.log_perplexity(corpus)

# Compute coherence score  
def compute_coherence_score(dictionary, corpus, tokens, num_topics):
   model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42)
   coherence_model = CoherenceModel(model=model, texts=tokens, dictionary=dictionary, coherence='c_v')
   return coherence_model.get_coherence()

# Calculate scores with progress tracking
def calculate_scores(num_topics, dictionary, corpus, tokens):
   perplexity = compute_perplexity(dictionary, corpus, num_topics)
   coherence = compute_coherence_score(dictionary, corpus, tokens, num_topics)
   print(f"Completed: {num_topics} Topics - Perplexity: {perplexity}, Coherence: {coherence}")
   return perplexity, coherence
   
# Plot scores for each dataset
for dataset_name, dictionary, corpus, tokens, xlabel in [
   ('data_without', dictionary_data_without, corpus_data_without, data_without['processed'], "Group A(0 comments)"),
   ('data_with', dictionary_data_with, corpus_data_with, data_with['processed'], "Group B(1 or more comments)")
]:
   scores = [calculate_scores(num_topics, dictionary, corpus, tokens) for num_topics in topic_range]
   perplexity_scores, coherence_scores = zip(*scores)

   # Create plot
   fig, ax1 = plt.subplots()
   ax1.set_title(dataset_name)
   ax1.set_xlabel(xlabel)
   ax1.set_ylabel("Perplexity", color="tab:red")
   ax1.plot(topic_range, perplexity_scores, color="tab:red")
   ax1.tick_params(axis="y", labelcolor="tab:red")

   ax2 = ax1.twinx()
   ax2.set_ylabel("Coherence Score", color="tab:blue")
   ax2.plot(topic_range, coherence_scores, color="tab:blue")
   ax2.tick_params(axis="y", labelcolor="tab:blue")

   fig.tight_layout()
   plt.show()
Completed: 4 Topics - Perplexity: -6.694202300744531, Coherence: 0.5183502648919864
Completed: 5 Topics - Perplexity: -6.782394292295545, Coherence: 0.5190178272668947
Completed: 6 Topics - Perplexity: -6.850101864038894, Coherence: 0.5496991445281142
Completed: 7 Topics - Perplexity: -6.897715308137463, Coherence: 0.5452918949228462
Completed: 8 Topics - Perplexity: -6.9649837344264895, Coherence: 0.547730425416529
Completed: 9 Topics - Perplexity: -6.94488962418138, Coherence: 0.5234016628894879
Completed: 10 Topics - Perplexity: -7.006797607866297, Coherence: 0.5258574483438864
Completed: 11 Topics - Perplexity: -7.040775907845236, Coherence: 0.5202116773947044
Completed: 12 Topics - Perplexity: -7.08022206406787, Coherence: 0.5047261679319228
Completed: 13 Topics - Perplexity: -7.0521372745437185, Coherence: 0.4709636458147087
Completed: 14 Topics - Perplexity: -7.088286935140449, Coherence: 0.48421048417556206
Completed: 15 Topics - Perplexity: -7.127538838095048, Coherence: 0.4687486484987744
Completed: 16 Topics - Perplexity: -7.114620930418713, Coherence: 0.4624890371406705
Completed: 17 Topics - Perplexity: -7.1975189395647705, Coherence: 0.48723016276823555
Completed: 18 Topics - Perplexity: -7.227871895889235, Coherence: 0.5011506130597451
Completed: 19 Topics - Perplexity: -7.210372820143128, Coherence: 0.44895081954715704
Completed: 4 Topics - Perplexity: -7.039661237582169, Coherence: 0.4230930140250465
Completed: 5 Topics - Perplexity: -7.096526597236398, Coherence: 0.42428282221158853
Completed: 6 Topics - Perplexity: -7.151613650236074, Coherence: 0.4464536297759755
Completed: 7 Topics - Perplexity: -7.198608451554847, Coherence: 0.4293861277148522
Completed: 8 Topics - Perplexity: -7.2455033224805, Coherence: 0.4275871667108016
Completed: 9 Topics - Perplexity: -7.271459345733501, Coherence: 0.4319299916309159
Completed: 10 Topics - Perplexity: -7.3038622501464125, Coherence: 0.42089908896230027
Completed: 11 Topics - Perplexity: -7.342535086021269, Coherence: 0.4195796336958029
Completed: 12 Topics - Perplexity: -7.386926158415562, Coherence: 0.4358594159124464
Completed: 13 Topics - Perplexity: -7.4258744503621035, Coherence: 0.46420860530628094
Completed: 14 Topics - Perplexity: -7.445030907832302, Coherence: 0.4510695122103335
Completed: 15 Topics - Perplexity: -7.453250916369775, Coherence: 0.4469233799506487
Completed: 16 Topics - Perplexity: -7.4908246754304475, Coherence: 0.43761926402072737
Completed: 17 Topics - Perplexity: -7.49559072838817, Coherence: 0.44034263288168285
Completed: 18 Topics - Perplexity: -7.494526837791128, Coherence: 0.4412935760752916
Completed: 19 Topics - Perplexity: -7.515682874768846, Coherence: 0.41363212037108127
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Train the LDA models
lda_data_without = LdaModel(corpus_data_without, id2word=dictionary_data_without, num_topics=8, passes=20, random_state=42)
lda_data_with = LdaModel(corpus_data_with, id2word=dictionary_data_with, num_topics=13, passes=20, random_state=42)

vis_data_without = gensimvis.prepare(lda_data_without, corpus_data_without, dictionary_data_without, mds='mmds', n_jobs=1)
vis_data_with = gensimvis.prepare(lda_data_with, corpus_data_with, dictionary_data_with, mds='mmds', n_jobs=1)
for topic in lda_data_without.print_topics(num_topics=8):
    topic_num, topic_keywords = topic
    print(f"{topic_num} : {topic_keywords}")
0 : 0.088*"마약" + 0.024*"적발" + 0.013*"필요" + 0.010*"광고" + 0.010*"대책" + 0.009*"불법" + 0.007*"관세청" + 0.007*"한국" + 0.007*"거래" + 0.007*"밀수"
1 : 0.082*"마약" + 0.019*"경찰" + 0.017*"예방" + 0.015*"캠페인" + 0.009*"투약" + 0.009*"수사" + 0.009*"릴레이" + 0.008*"불법" + 0.007*"범죄" + 0.006*"마약왕"
2 : 0.079*"마약" + 0.013*"마약사범" + 0.011*"범죄" + 0.009*"본부" + 0.009*"김밥" + 0.009*"마약김밥" + 0.008*"수사" + 0.006*"정부" + 0.006*"퇴치" + 0.006*"유통"
3 : 0.088*"마약" + 0.019*"범죄" + 0.014*"마약왕" + 0.014*"정부" + 0.011*"도시" + 0.011*"근절" + 0.011*"청소년" + 0.007*"침투" + 0.007*"기적" + 0.007*"경찰"
4 : 0.097*"마약" + 0.021*"경찰" + 0.015*"범죄" + 0.012*"청소년" + 0.012*"혐의" + 0.009*"수사" + 0.009*"교육" + 0.006*"판매" + 0.006*"강남" + 0.006*"특별"
5 : 0.069*"마약" + 0.017*"투약" + 0.017*"범죄" + 0.013*"마약범죄" + 0.010*"예방" + 0.010*"청소년" + 0.010*"구속" + 0.007*"캠페인" + 0.007*"강남" + 0.007*"마약성"
6 : 0.124*"마약" + 0.017*"범죄" + 0.017*"수사" + 0.015*"청소년" + 0.014*"마약범죄" + 0.012*"마약사범" + 0.012*"예방" + 0.011*"중독" + 0.010*"음료" + 0.009*"재활"
7 : 0.110*"마약" + 0.026*"전쟁" + 0.016*"타워" + 0.016*"컨트롤" + 0.016*"컨트롤타워" + 0.011*"단속" + 0.011*"음료" + 0.011*"마약음료" + 0.011*"경찰" + 0.011*"예방"
for topic in lda_data_with.print_topics(num_topics=13):
    topic_num, topic_keywords = topic
    print(f"{topic_num} : {topic_keywords}")
0 : 0.087*"마약" + 0.016*"동남아" + 0.015*"마약왕" + 0.014*"범죄" + 0.011*"구속" + 0.010*"총책" + 0.009*"손자" + 0.008*"투약" + 0.007*"대통령" + 0.007*"전두환"
1 : 0.093*"마약" + 0.026*"전쟁" + 0.018*"범죄" + 0.016*"수사" + 0.011*"마약범죄" + 0.009*"한동훈" + 0.009*"검찰" + 0.008*"대검" + 0.007*"참사" + 0.007*"마약사범"
2 : 0.065*"마약" + 0.011*"마약왕" + 0.009*"국내" + 0.008*"징역형" + 0.008*"부모" + 0.006*"신고" + 0.005*"적발" + 0.005*"참사" + 0.005*"검찰" + 0.005*"태국인"
3 : 0.110*"마약" + 0.061*"음료" + 0.039*"마약음료" + 0.019*"강남" + 0.016*"학원가" + 0.011*"필로폰" + 0.011*"경찰" + 0.010*"투약" + 0.008*"수사" + 0.008*"조직"
4 : 0.098*"마약" + 0.035*"수사" + 0.027*"특별" + 0.021*"범죄" + 0.016*"마약범죄" + 0.015*"출범" + 0.015*"특별수사팀" + 0.013*"청소년" + 0.011*"전쟁" + 0.011*"정부"
5 : 0.092*"마약" + 0.032*"마약왕" + 0.029*"수리" + 0.025*"수리남" + 0.018*"멕시코" + 0.016*"밀반입" + 0.016*"반입" + 0.012*"검거" + 0.010*"미국" + 0.008*"사망"
6 : 0.086*"마약" + 0.016*"정국" + 0.015*"음주" + 0.014*"옛말" + 0.013*"뺑소니" + 0.012*"마약청정국" + 0.010*"무면허" + 0.010*"운전자" + 0.010*"사고" + 0.010*"하수"
7 : 0.099*"마약" + 0.026*"범죄" + 0.018*"전쟁" + 0.014*"일상" + 0.014*"손자" + 0.013*"투약" + 0.013*"전두환" + 0.012*"경찰" + 0.011*"밀수" + 0.011*"마약범죄"
8 : 0.101*"마약" + 0.056*"검거" + 0.036*"경찰" + 0.023*"투약" + 0.018*"마약사범" + 0.017*"무더기" + 0.016*"유아" + 0.015*"수사" + 0.015*"혐의" + 0.014*"필로폰"
9 : 0.101*"마약" + 0.022*"한국" + 0.017*"명분" + 0.013*"밀수" + 0.013*"클럽" + 0.012*"강남" + 0.012*"마약사범" + 0.010*"텔레그램" + 0.008*"수사" + 0.008*"조직"
10 : 0.129*"마약" + 0.024*"범죄" + 0.016*"중독" + 0.013*"수사" + 0.013*"한동훈" + 0.011*"보이스피싱" + 0.011*"치료" + 0.011*"보이스" + 0.010*"대검" + 0.009*"특수"
11 : 0.124*"마약" + 0.016*"전쟁" + 0.015*"수사" + 0.012*"대통령" + 0.012*"유통" + 0.011*"마약사범" + 0.010*"단속" + 0.010*"적발" + 0.010*"경찰" + 0.008*"강남"
12 : 0.105*"마약" + 0.041*"마약사범" + 0.015*"유혹" + 0.012*"급증" + 0.011*"처벌" + 0.011*"세대" + 0.010*"거래" + 0.010*"청정" + 0.010*"지위" + 0.010*"중학생"
  • Doc2bow analysis shows Group B (high-comment articles) features topics like Han Dong-hoon (government policy) and Chun Doo-hwan's grandson's drug case
  • Decided direct TF-IDF comparison provides better explanatory power
  • TF-IDF embedding with topic modeling yielded insignificant results

Next step: Comparing groups using TF-IDF technique only

import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import LdaModel, TfidfModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

# Calculate TF-IDF for each dataset
tfidf_data_without = TfidfModel(corpus_data_without)
tfidf_data_with = TfidfModel(corpus_data_with)

# Convert the corpus to a TF-IDF representation
corpus_tfidf_data_without = tfidf_data_without[corpus_data_without]
corpus_tfidf_data_with = tfidf_data_with[corpus_data_with]


# Find the top N keywords based on average TF-IDF scores
def get_top_N_keywords(tfidf_corpus, dictionary, N):
    avg_tfidf = np.zeros(len(dictionary))

    for doc in tfidf_corpus:
        for term_id, tfidf_score in doc:
            avg_tfidf[term_id] += tfidf_score

    avg_tfidf /= len(tfidf_corpus)
    top_N_indices = avg_tfidf.argsort()[-N:][::-1]
    top_N_keywords = [(dictionary[i], avg_tfidf[i]) for i in top_N_indices]

    return top_N_keywords

# Find the top N keywords for each dataset
N = 10
top_N_keywords_data_without = get_top_N_keywords(corpus_tfidf_data_without, dictionary_data_without, N)
top_N_keywords_data_with = get_top_N_keywords(corpus_tfidf_data_with, dictionary_data_with, N)

# Compare the top N keywords between the two datasets
print("Top 10 keywords in A그룹(0 comments):", top_N_keywords_data_without)
print("Top 10 keywords in B그룹(1 or more comments):", top_N_keywords_data_with)
Top 10 keywords in A그룹(0 comments): [('범죄', 0.025118233782760943), ('마약사범', 0.021923877795007648), ('청소년', 0.021809244757240193), ('예방', 0.019322443618203237), ('수사', 0.01771660649457), ('전쟁', 0.01769250384200995), ('마약범죄', 0.017074217372907808), ('경찰', 0.015992481921368002), ('마약왕', 0.015234999038888414), ('음료', 0.014510423614610218)]
Top 10 keywords in B그룹(1 or more comments): [('전쟁', 0.0212565977159329), ('마약사범', 0.019878078055057705), ('수사', 0.01806327796821882), ('음료', 0.017546353894940758), ('범죄', 0.016550337022138394), ('경찰', 0.015130577531840132), ('투약', 0.014714449071572759), ('검거', 0.01431982179820058), ('마약음료', 0.012922448066324253), ('강남', 0.012424066527380168)]

Results show top 10 TF-IDF keywords from both groups. Similar keywords appear across groups (e.g., Gangnam drug drink incident), reflecting similar headlines during specific reporting periods.

import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
import numpy as np

# Merge the two dictionaries
merged_dictionary = Dictionary(documents=data_without['processed'].tolist() + data_with['processed'].tolist())

# Convert the original corpora to the merged dictionary
corpus_data_without_merged = [merged_dictionary.doc2bow(doc) for doc in data_without['processed']]
corpus_data_with_merged = [merged_dictionary.doc2bow(doc) for doc in data_with['processed']]

# Update the TfidfModels
tfidf_data_without = TfidfModel(corpus_data_without_merged)
tfidf_data_with = TfidfModel(corpus_data_with_merged)

# Convert the corpus to a TF-IDF representation
corpus_tfidf_data_without_merged = tfidf_data_without[corpus_data_without_merged]
corpus_tfidf_data_with_merged = tfidf_data_with[corpus_data_with_merged]

# Calculate the average TF-IDF scores for each term in the given corpus
def calculate_avg_tfidf(tfidf_corpus, dictionary):
    avg_tfidf = np.zeros(len(dictionary))

    for doc in tfidf_corpus:
        for term_id, tfidf_score in doc:
            avg_tfidf[term_id] += tfidf_score

    avg_tfidf /= len(tfidf_corpus)

    return avg_tfidf

# Calculate average TF-IDF scores for each dataset
avg_tfidf_data_without = calculate_avg_tfidf(corpus_tfidf_data_without_merged, merged_dictionary)
avg_tfidf_data_with = calculate_avg_tfidf(corpus_tfidf_data_with_merged, merged_dictionary)

# Calculate the difference in average TF-IDF scores between the two datasets
tfidf_diff = avg_tfidf_data_with - avg_tfidf_data_without

# Sort the terms based on the difference in their average scores (higher in data_with)
sorted_indices = np.argsort(tfidf_diff)[::-1]

# Print the terms with their average TF-IDF scores in both groups and their difference
print(f"{'Term':<40}{'Group A':<10}{'Group B':<10}{'Difference':<10}")
print("-" * 50)

for i in sorted_indices[:40]:  # Display the top 20 terms
    term = merged_dictionary[i]
    group_a_score = avg_tfidf_data_without[i]
    group_b_score = avg_tfidf_data_with[i]
    diff = tfidf_diff[i]
    print(f"{term:<40}{group_a_score:<10.4f}{group_b_score:<10.4f}{diff:<10.4f}")
Term                                    Group A   Group B   Difference
--------------------------------------------------
한동훈                                     0.0011    0.0068    0.0057    
강력부                                     0.0000    0.0054    0.0054    
급증                                      0.0040    0.0092    0.0052    
조직                                      0.0045    0.0097    0.0052    
중국                                      0.0000    0.0048    0.0048    
파티                                      0.0000    0.0048    0.0048    
필로폰                                     0.0064    0.0110    0.0047    
강남                                      0.0078    0.0124    0.0046    
명분                                      0.0023    0.0067    0.0044    
유혹                                      0.0012    0.0056    0.0044    
특수                                      0.0027    0.0070    0.0043    
지시                                      0.0018    0.0059    0.0041    
사망                                      0.0015    0.0055    0.0039    
손자                                      0.0014    0.0051    0.0037    
대마                                      0.0028    0.0065    0.0036    
전쟁                                      0.0177    0.0213    0.0036    
특수본                                     0.0015    0.0051    0.0036    
대검                                      0.0034    0.0069    0.0035    
단속                                      0.0060    0.0095    0.0035    
학생                                      0.0017    0.0052    0.0035    
신종                                      0.0000    0.0034    0.0034    
검출                                      0.0000    0.0033    0.0033    
선포                                      0.0024    0.0057    0.0032    
중학생                                     0.0000    0.0032    0.0032    
전두환                                     0.0014    0.0045    0.0032    
폭행                                      0.0000    0.0031    0.0031    
전면전                                     0.0015    0.0046    0.0031    
배달                                      0.0000    0.0031    0.0031    
부산                                      0.0000    0.0031    0.0031    
유통                                      0.0068    0.0099    0.0031    
음료                                      0.0145    0.0175    0.0030    
대치동                                     0.0000    0.0030    0.0030    
대치                                      0.0000    0.0030    0.0030    
공급                                      0.0016    0.0046    0.0030    
커피                                      0.0000    0.0029    0.0029    
옛말                                      0.0000    0.0029    0.0029    
알바                                      0.0000    0.0029    0.0029    
무더기                                     0.0028    0.0057    0.0028    
어른                                      0.0000    0.0028    0.0028    
검찰                                      0.0055    0.0083    0.0028

TF-IDF score comparison identified terms exclusive to Group B (articles with comments, zero value in Group A).
Analysis revealed 5 key patterns in high-engagement articles:
1. Political/drug news (e.g., Han Dong-hoon, Chun Doo-hwan cases)
2. Specific drug names (methamphetamine, marijuana, new synthetic drugs)
3. Local incident details (Gangnam drink case, specific locations)
4. Distribution channels (delivery, supply, part-time workers)
5. Sensational language (surge, all-out war, mass arrests)

Next step: To avoid logical leaps that could occur if these patterns appear in both groups, we will now examine terms that are distinctively characteristic of Group A (articles without comments).

import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
import numpy as np

# Merge the two dictionaries
merged_dictionary = Dictionary(documents=data_without['processed'].tolist() + data_with['processed'].tolist())

# Convert the original corpora to the merged dictionary
corpus_data_without_merged = [merged_dictionary.doc2bow(doc) for doc in data_without['processed']]
corpus_data_with_merged = [merged_dictionary.doc2bow(doc) for doc in data_with['processed']]

# Update the TfidfModels
tfidf_data_without = TfidfModel(corpus_data_without_merged)
tfidf_data_with = TfidfModel(corpus_data_with_merged)

# Convert the corpus to a TF-IDF representation
corpus_tfidf_data_without_merged = tfidf_data_without[corpus_data_without_merged]
corpus_tfidf_data_with_merged = tfidf_data_with[corpus_data_with_merged]

# Calculate the average TF-IDF scores for each term in the given corpus
def calculate_avg_tfidf(tfidf_corpus, dictionary):
    avg_tfidf = np.zeros(len(dictionary))

    for doc in tfidf_corpus:
        for term_id, tfidf_score in doc:
            avg_tfidf[term_id] += tfidf_score

    avg_tfidf /= len(tfidf_corpus)

    return avg_tfidf

# Calculate average TF-IDF scores for each dataset
avg_tfidf_data_without = calculate_avg_tfidf(corpus_tfidf_data_without_merged, merged_dictionary)
avg_tfidf_data_with = calculate_avg_tfidf(corpus_tfidf_data_with_merged, merged_dictionary)

# Calculate the difference in average TF-IDF scores between the two datasets
tfidf_diff = avg_tfidf_data_without - avg_tfidf_data_with 

# Sort the terms based on the difference in their average scores (higher in data_with)
sorted_indices = np.argsort(tfidf_diff)[::-1]

# Print the terms with their average TF-IDF scores in both groups and their difference
print(f"{'Term':<40}{'Group A':<10}{'Group B':<10}{'Difference':<10}")
print("-" * 50)

for i in sorted_indices[:40]:  # Display the top 20 terms
    term = merged_dictionary[i]
    group_a_score = avg_tfidf_data_without[i]
    group_b_score = avg_tfidf_data_with[i]
    diff = tfidf_diff[i]
    print(f"{term:<40}{group_a_score:<10.4f}{group_b_score:<10.4f}{diff:<10.4f}")
Term                                    Group A   Group B   Difference
--------------------------------------------------
예방                                      0.0193    0.0029    0.0164    
청소년                                     0.0218    0.0098    0.0120    
캠페인                                     0.0118    0.0010    0.0108    
범죄                                      0.0251    0.0166    0.0086    
교육                                      0.0104    0.0021    0.0083    
마약범죄                                    0.0171    0.0092    0.0078    
진통제                                     0.0090    0.0023    0.0067    
근절                                      0.0093    0.0027    0.0066    
재활                                      0.0116    0.0052    0.0064    
릴레이                                     0.0063    0.0000    0.0063    
도시                                      0.0060    0.0000    0.0060    
외국인                                     0.0086    0.0025    0.0060    
필요                                      0.0081    0.0021    0.0060    
반입                                      0.0094    0.0039    0.0055    
컨트롤타워                                   0.0080    0.0025    0.0055    
컨트롤                                     0.0080    0.0025    0.0055    
예방교육                                    0.0065    0.0011    0.0054    
고교생                                     0.0085    0.0031    0.0054    
타워                                      0.0080    0.0028    0.0052    
지원                                      0.0068    0.0017    0.0051    
관세청                                     0.0072    0.0022    0.0050    
서울시                                     0.0054    0.0005    0.0049    
환자                                      0.0053    0.0004    0.0049    
마약중독                                    0.0080    0.0032    0.0048    
발의                                      0.0051    0.0005    0.0046    
경찰서                                     0.0046    0.0000    0.0046    
증가                                      0.0084    0.0038    0.0046    
마약성                                     0.0064    0.0020    0.0043    
학교                                      0.0058    0.0014    0.0043    
마약왕                                     0.0152    0.0109    0.0043    
실시                                      0.0048    0.0004    0.0043    
히틀러                                     0.0055    0.0012    0.0043    
불법                                      0.0082    0.0041    0.0042    
임계점                                     0.0050    0.0008    0.0042    
강남구                                     0.0042    0.0000    0.0042    
교육청                                     0.0049    0.0008    0.0041    
중독자                                     0.0068    0.0027    0.0041    
조례                                      0.0041    0.0000    0.0041    
수도                                      0.0046    0.0005    0.0041    
치료시설                                    0.0041    0.0000    0.0041

Searched for counterexamples to Group B's five characteristics in Group A. While the term ‘강남구(Gangnam-gu)’ could potentially counter Group B’s third characteristic (specific incidents) as referencing the Gangnam drink incident, its absence in Group B makes it unsuitable for comparison. No other counterexamples were found.

Next step: After completing our TF-IDF analysis, we will explore keyword characteristics using Word2vec for additional insights.
Word2vec: Neural network technique that converts words to vectors based on context, capturing semantic relationships between words that appear in similar contexts

import pandas as pd
import numpy as np
from gensim.models import Word2Vec

# Combine the preprocessed text from both groups
combined_text = data_without['processed'].tolist() + data_with['processed'].tolist()

# Train a Word2Vec model using both datasets
model = Word2Vec(sentences=combined_text, vector_size=100, window=5, min_count=1, workers=4)

# Calculate the average word embeddings for each article in both groups
def average_word_embeddings(text, word2vec_model):
    embeddings = []
    for word in text:
        if word in word2vec_model.wv:
            embeddings.append(word2vec_model.wv[word])
    if embeddings:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros((model.vector_size,))

data_without['avg_word_embeddings'] = data_without['processed'].apply(average_word_embeddings, word2vec_model=model)
data_with['avg_word_embeddings'] = data_with['processed'].apply(average_word_embeddings, word2vec_model=model)

# Calculate the average word embeddings for both groups
group_a_avg = np.mean(np.vstack(data_without['avg_word_embeddings']), axis=0)
group_b_avg = np.mean(np.vstack(data_with['avg_word_embeddings']), axis=0)

# Get all unique keywords from both groups
keywords = set()
for text in combined_text:
    keywords.update(text)

keywords = list(keywords)
keyword_vectors = np.vstack([model.wv[keyword] for keyword in keywords])

# Calculate the similarity scores between the keywords and the average word embeddings for both groups
keyword_scores_a = np.dot(keyword_vectors, group_a_avg)
keyword_scores_b = np.dot(keyword_vectors, group_b_avg)

# Calculate the difference in similarity scores between the two groups for each keyword
keyword_diffs = keyword_scores_b - keyword_scores_a

# Sort the keywords based on the difference in their embeddings (higher in data_with)
sorted_indices = np.argsort(keyword_diffs)[::-1]

# Print the keywords with their similarity scores in both groups and their difference
print(f"{'Term':<40}{'Group A':<10}{'Group B':<10}{'Difference':<10}")
print("-" * 50)

count = 0
for i in sorted_indices:
    if count >= 40:  # Display the top 20 terms
        break

    term = keywords[i]
    group_a_score = keyword_scores_a[i]
    group_b_score = keyword_scores_b[i]
    diff = keyword_diffs[i]

    print(f"{term:<40}{group_a_score:<10.4f}{group_b_score:<10.4f}{diff:<10.4f}")
    count += 1
Term                                    Group A   Group B   Difference
--------------------------------------------------
압수물                                     -0.0041   -0.0040   0.0001    
절대                                      -0.0026   -0.0025   0.0000    
이주                                      -0.0020   -0.0020   0.0000    
도주                                      -0.0023   -0.0023   0.0000    
시위                                      -0.0007   -0.0007   0.0000    
신전                                      -0.0024   -0.0023   0.0000    
가방                                      -0.0020   -0.0020   0.0000    
즉각                                      -0.0012   -0.0011   0.0000    
평창경찰                                    -0.0028   -0.0028   0.0000    
다크웹으                                    -0.0016   -0.0015   0.0000    
버닝                                      -0.0011   -0.0010   0.0000    
만명                                      -0.0017   -0.0017   0.0000    
동해                                      -0.0028   -0.0027   0.0000    
패소                                      -0.0023   -0.0023   0.0000    
연결                                      -0.0013   -0.0012   0.0000    
전방위                                     -0.0015   -0.0014   0.0000    
다이어트약                                   -0.0012   -0.0012   0.0000    
휴일                                      -0.0014   -0.0014   0.0000    
페스티벌                                    -0.0016   -0.0016   0.0000    
방해                                      -0.0019   -0.0019   0.0000    
반성                                      -0.0010   -0.0009   0.0000    
경필                                      -0.0016   -0.0015   0.0000    
방책                                      -0.0019   -0.0019   0.0000    
포퓰리즘                                    -0.0022   -0.0022   0.0000    
관물대                                     -0.0024   -0.0024   0.0000    
채팅                                      -0.0012   -0.0012   0.0000    
중국발                                     -0.0003   -0.0003   0.0000    
군수                                      -0.0002   -0.0002   0.0000    
위력순찰                                    -0.0009   -0.0009   0.0000    
우주                                      -0.0013   -0.0013   0.0000    
불안하다                                    -0.0014   -0.0014   0.0000    
전쟁살인                                    -0.0014   -0.0014   0.0000    
나르키소스                                   -0.0019   -0.0019   0.0000    
포르                                      -0.0010   -0.0010   0.0000    
계파                                      -0.0009   -0.0009   0.0000    
마조부                                     -0.0002   -0.0002   0.0000    
주택                                      -0.0009   -0.0009   0.0000    
충청권                                     -0.0012   -0.0012   0.0000    
원석                                      -0.0017   -0.0017   0.0000    
술잔                                      -0.0004   -0.0003   0.0000

Since word2vec finds semantic similarities by analyzing contextual flow, no significant differences were observed.

Next step: Frequency analysis and word cloud visualization

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc

# 한글 폰트 설정
font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
rc('font', family=font_name)

# # Create a function to clean tokens
# def clean_token(token):
#     return token.replace("[", "").replace("]", "").replace("'", "").strip()

def word_graph(ax, cnt, group_name, max_words=10):
    sorted_w = sorted(cnt.items(), key=lambda kv: kv[1])
    print(f"{group_name}:\n", sorted_w[-max_words:])
    w, n = zip(*sorted_w[-max_words:])
    ax.barh(range(len(w)), n, tick_label=w)
    ax.set_title(group_name)
    
# 텍스트 정제 함수: 한글 이외의 문자는 전부 제거합니다.
def clean_token(text):
    # 한글의 정규표현식으로 한글만 추출합니다.
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
    result = hangul.sub('', text)
    return result

def split_text(text):
    tokens = text.split()
    return tokens

# 텍스트 정제를 적용하여 처리된 텍스트를 확인합니다.
data_without['title_cleaned'] = data_without['title_tokenized'].apply(text_cleaning)
data_with['title_cleaned'] = data_with['title_tokenized'].apply(text_cleaning)

data_without['processed'] = data_without['title_cleaned'].apply(split_text)
data_with['processed'] = data_with['title_cleaned'].apply(split_text)

fig, axes = plt.subplots(2, 1, figsize=(10, 12))

# Group A
data_A = pd.read_csv('data_without_comments4.csv')
data_A['title_cleaned'] = data_A['title_tokenized'].apply(text_cleaning)
data_A['processed'] = data_A['title_cleaned'].apply(split_text)

content_tokens_A = data_A['processed']

tokens_cnt_A = {}
for tokens in content_tokens_A:
    for token in tokens:
        cleaned_token = clean_token(token)
        if cleaned_token:
            tokens_cnt_A[cleaned_token] = tokens_cnt_A.get(cleaned_token, 0) + 1

word_graph(axes[0], tokens_cnt_A, "Group A", max_words=20)

# Group B
data_B = pd.read_csv('data_with_comments4.csv')
data_B['title_cleaned'] = data_B['title_tokenized'].apply(text_cleaning)
data_B['processed'] = data_B['title_cleaned'].apply(split_text)

content_tokens_B = data_B['processed']

tokens_cnt_B = {}
for tokens in content_tokens_B:
    for token in tokens:
        cleaned_token = clean_token(token)
        if cleaned_token:
            tokens_cnt_B[cleaned_token] = tokens_cnt_B.get(cleaned_token, 0) + 1

word_graph(axes[1], tokens_cnt_B, "Group B", max_words=20)

plt.tight_layout()
plt.show()
Term                                    Group A   Group B   Difference
--------------------------------------------------
압수물                                     -0.0041   -0.0040   0.0001    
절대                                      -0.0026   -0.0025   0.0000    
이주                                      -0.0020   -0.0020   0.0000    
도주                                      -0.0023   -0.0023   0.0000    
시위                                      -0.0007   -0.0007   0.0000    
신전                                      -0.0024   -0.0023   0.0000    
가방                                      -0.0020   -0.0020   0.0000    
즉각                                      -0.0012   -0.0011   0.0000    
평창경찰                                    -0.0028   -0.0028   0.0000    
다크웹으                                    -0.0016   -0.0015   0.0000    
버닝                                      -0.0011   -0.0010   0.0000    
만명                                      -0.0017   -0.0017   0.0000    
동해                                      -0.0028   -0.0027   0.0000    
패소                                      -0.0023   -0.0023   0.0000    
연결                                      -0.0013   -0.0012   0.0000    
전방위                                     -0.0015   -0.0014   0.0000    
다이어트약                                   -0.0012   -0.0012   0.0000    
휴일                                      -0.0014   -0.0014   0.0000    
페스티벌                                    -0.0016   -0.0016   0.0000    
방해                                      -0.0019   -0.0019   0.0000    
반성                                      -0.0010   -0.0009   0.0000    
경필                                      -0.0016   -0.0015   0.0000    
방책                                      -0.0019   -0.0019   0.0000    
포퓰리즘                                    -0.0022   -0.0022   0.0000    
관물대                                     -0.0024   -0.0024   0.0000    
채팅                                      -0.0012   -0.0012   0.0000    
중국발                                     -0.0003   -0.0003   0.0000    
군수                                      -0.0002   -0.0002   0.0000    
위력순찰                                    -0.0009   -0.0009   0.0000    
우주                                      -0.0013   -0.0013   0.0000    
불안하다                                    -0.0014   -0.0014   0.0000    
전쟁살인                                    -0.0014   -0.0014   0.0000    
나르키소스                                   -0.0019   -0.0019   0.0000    
포르                                      -0.0010   -0.0010   0.0000    
계파                                      -0.0009   -0.0009   0.0000    
마조부                                     -0.0002   -0.0002   0.0000    
주택                                      -0.0009   -0.0009   0.0000    
충청권                                     -0.0012   -0.0012   0.0000    
원석                                      -0.0017   -0.0017   0.0000    
술잔                                      -0.0004   -0.0003   0.0000

Since word2vec finds semantic similarities by analyzing contextual flow, no significant differences were observed.

Next step: Frequency analysis and word cloud visualization