#3-2 Title-Comment Count Feature Comparison

Comparison between Groups with Comments and without Comments

After analyzing article content, attempted feature extraction from article headlines, yielding clearer results
Suggests public engagement varies based on headline characteristics
Process follows similar methods as #3-1 (doc2bow-LDA topic extraction, TF-IDF, word2vec score comparison, wordcloud analysis)

!pip install gensim
import pandas as pd

# Load the modified data
data = pd.read_csv("final_combined.csv")

# Remove commas from the comment_count column
# data['comment_count'] = data['comment_count'].str.replace(',', '')

# Convert the comment_count column to floats
data['comment_count'] = data['comment_count'].astype(float)

# Fill missing values with 0
data['comment_count'].fillna(0, inplace=True)

# Convert the comment_count column to integers
data['comment_count'] = data['comment_count'].astype(int)

data_without = data[data['comment_count'] == 0]
data_with = data[data['comment_count'] >= 1]

# Save the classified data to separate CSV files
data_without.to_csv("data_without_comments4.csv", index=False)
data_with.to_csv("data_with_comments4.csv", index=False)

print(len(data_without))
print(len(data_with))

306
921

The code above uses Word2Vec to identify keywords emphasized more in group B versus group A.
It calculates keyword similarity scores and mean word embeddings between groups to identify terms more prominent in group B.
While differences exist, they were deemed too minimal or insufficiently supported to be considered significant characteristics.

# Read CSV files
data_without = pd.read_csv("data_without_comments4.csv") 
data_with = pd.read_csv("data_with_comments4.csv")

# Clean text function: Remove non-Korean characters
def text_cleaning(text):
   hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
   result = hangul.sub('', text)
   return result

def split_text(text):
   return text.split()

# Clean text and tokenize
data_without['title_cleaned'] = data_without['title_tokenized'].apply(text_cleaning)
data_with['title_cleaned'] = data_with['title_tokenized'].apply(text_cleaning)

data_without['processed'] = data_without['title_cleaned'].apply(split_text) 
data_with['processed'] = data_with['title_cleaned'].apply(split_text)

# Create dictionaries and corpora
dictionary_data_without = Dictionary(data_without['processed'])
dictionary_data_with = Dictionary(data_with['processed'])

corpus_data_without = [dictionary_data_without.doc2bow(doc) for doc in data_without['processed']]
corpus_data_with = [dictionary_data_with.doc2bow(doc) for doc in data_with['processed']]

# Set topic range (4-20)
topic_range = range(4, 20, 1)

# Compute perplexity
def compute_perplexity(dictionary, corpus, num_topics):
   model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42)
   return model.log_perplexity(corpus)

# Compute coherence score  
def compute_coherence_score(dictionary, corpus, tokens, num_topics):
   model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=5, random_state=42)
   coherence_model = CoherenceModel(model=model, texts=tokens, dictionary=dictionary, coherence='c_v')
   return coherence_model.get_coherence()

# Calculate scores with progress tracking
def calculate_scores(num_topics, dictionary, corpus, tokens):
   perplexity = compute_perplexity(dictionary, corpus, num_topics)
   coherence = compute_coherence_score(dictionary, corpus, tokens, num_topics)
   print(f"Completed: {num_topics} Topics - Perplexity: {perplexity}, Coherence: {coherence}")
   return perplexity, coherence
   
# Plot scores for each dataset
for dataset_name, dictionary, corpus, tokens, xlabel in [
   ('data_without', dictionary_data_without, corpus_data_without, data_without['processed'], "Group A(0 comments)"),
   ('data_with', dictionary_data_with, corpus_data_with, data_with['processed'], "Group B(1 or more comments)")
]:
   scores = [calculate_scores(num_topics, dictionary, corpus, tokens) for num_topics in topic_range]
   perplexity_scores, coherence_scores = zip(*scores)

   # Create plot
   fig, ax1 = plt.subplots()
   ax1.set_title(dataset_name)
   ax1.set_xlabel(xlabel)
   ax1.set_ylabel("Perplexity", color="tab:red")
   ax1.plot(topic_range, perplexity_scores, color="tab:red")
   ax1.tick_params(axis="y", labelcolor="tab:red")

   ax2 = ax1.twinx()
   ax2.set_ylabel("Coherence Score", color="tab:blue")
   ax2.plot(topic_range, coherence_scores, color="tab:blue")
   ax2.tick_params(axis="y", labelcolor="tab:blue")

   fig.tight_layout()
   plt.show()

Completed: 4 Topics - Perplexity: -6.694202300744531, Coherence: 0.5183502648919864
Completed: 5 Topics - Perplexity: -6.782394292295545, Coherence: 0.5190178272668947
Completed: 6 Topics - Perplexity: -6.850101864038894, Coherence: 0.5496991445281142
Completed: 7 Topics - Perplexity: -6.897715308137463, Coherence: 0.5452918949228462
Completed: 8 Topics - Perplexity: -6.9649837344264895, Coherence: 0.547730425416529
Completed: 9 Topics - Perplexity: -6.94488962418138, Coherence: 0.5234016628894879
Completed: 10 Topics - Perplexity: -7.006797607866297, Coherence: 0.5258574483438864
Completed: 11 Topics - Perplexity: -7.040775907845236, Coherence: 0.5202116773947044
Completed: 12 Topics - Perplexity: -7.08022206406787, Coherence: 0.5047261679319228
Completed: 13 Topics - Perplexity: -7.0521372745437185, Coherence: 0.4709636458147087
Completed: 14 Topics - Perplexity: -7.088286935140449, Coherence: 0.48421048417556206
Completed: 15 Topics - Perplexity: -7.127538838095048, Coherence: 0.4687486484987744
Completed: 16 Topics - Perplexity: -7.114620930418713, Coherence: 0.4624890371406705
Completed: 17 Topics - Perplexity: -7.1975189395647705, Coherence: 0.48723016276823555
Completed: 18 Topics - Perplexity: -7.227871895889235, Coherence: 0.5011506130597451
Completed: 19 Topics - Perplexity: -7.210372820143128, Coherence: 0.44895081954715704

Completed: 4 Topics - Perplexity: -7.039661237582169, Coherence: 0.4230930140250465
Completed: 5 Topics - Perplexity: -7.096526597236398, Coherence: 0.42428282221158853
Completed: 6 Topics - Perplexity: -7.151613650236074, Coherence: 0.4464536297759755
Completed: 7 Topics - Perplexity: -7.198608451554847, Coherence: 0.4293861277148522
Completed: 8 Topics - Perplexity: -7.2455033224805, Coherence: 0.4275871667108016
Completed: 9 Topics - Perplexity: -7.271459345733501, Coherence: 0.4319299916309159
Completed: 10 Topics - Perplexity: -7.3038622501464125, Coherence: 0.42089908896230027
Completed: 11 Topics - Perplexity: -7.342535086021269, Coherence: 0.4195796336958029
Completed: 12 Topics - Perplexity: -7.386926158415562, Coherence: 0.4358594159124464
Completed: 13 Topics - Perplexity: -7.4258744503621035, Coherence: 0.46420860530628094
Completed: 14 Topics - Perplexity: -7.445030907832302, Coherence: 0.4510695122103335
Completed: 15 Topics - Perplexity: -7.453250916369775, Coherence: 0.4469233799506487
Completed: 16 Topics - Perplexity: -7.4908246754304475, Coherence: 0.43761926402072737
Completed: 17 Topics - Perplexity: -7.49559072838817, Coherence: 0.44034263288168285
Completed: 18 Topics - Perplexity: -7.494526837791128, Coherence: 0.4412935760752916
Completed: 19 Topics - Perplexity: -7.515682874768846, Coherence: 0.41363212037108127

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

# Train the LDA models
lda_data_without = LdaModel(corpus_data_without, id2word=dictionary_data_without, num_topics=8, passes=20, random_state=42)
lda_data_with = LdaModel(corpus_data_with, id2word=dictionary_data_with, num_topics=13, passes=20, random_state=42)

vis_data_without = gensimvis.prepare(lda_data_without, corpus_data_without, dictionary_data_without, mds='mmds', n_jobs=1)
vis_data_with = gensimvis.prepare(lda_data_with, corpus_data_with, dictionary_data_with, mds='mmds', n_jobs=1)

pyLDAvis.display(vis_data_without)

for topic in lda_data_without.print_topics(num_topics=8):
    topic_num, topic_keywords = topic
    print(f"{topic_num} : {topic_keywords}")

0 : 0.088*"마약" + 0.024*"적발" + 0.013*"필요" + 0.010*"광고" + 0.010*"대책" + 0.009*"불법" + 0.007*"관세청" + 0.007*"한국" + 0.007*"거래" + 0.007*"밀수"
1 : 0.082*"마약" + 0.019*"경찰" + 0.017*"예방" + 0.015*"캠페인" + 0.009*"투약" + 0.009*"수사" + 0.009*"릴레이" + 0.008*"불법" + 0.007*"범죄" + 0.006*"마약왕"
2 : 0.079*"마약" + 0.013*"마약사범" + 0.011*"범죄" + 0.009*"본부" + 0.009*"김밥" + 0.009*"마약김밥" + 0.008*"수사" + 0.006*"정부" + 0.006*"퇴치" + 0.006*"유통"
3 : 0.088*"마약" + 0.019*"범죄" + 0.014*"마약왕" + 0.014*"정부" + 0.011*"도시" + 0.011*"근절" + 0.011*"청소년" + 0.007*"침투" + 0.007*"기적" + 0.007*"경찰"
4 : 0.097*"마약" + 0.021*"경찰" + 0.015*"범죄" + 0.012*"청소년" + 0.012*"혐의" + 0.009*"수사" + 0.009*"교육" + 0.006*"판매" + 0.006*"강남" + 0.006*"특별"
5 : 0.069*"마약" + 0.017*"투약" + 0.017*"범죄" + 0.013*"마약범죄" + 0.010*"예방" + 0.010*"청소년" + 0.010*"구속" + 0.007*"캠페인" + 0.007*"강남" + 0.007*"마약성"
6 : 0.124*"마약" + 0.017*"범죄" + 0.017*"수사" + 0.015*"청소년" + 0.014*"마약범죄" + 0.012*"마약사범" + 0.012*"예방" + 0.011*"중독" + 0.010*"음료" + 0.009*"재활"
7 : 0.110*"마약" + 0.026*"전쟁" + 0.016*"타워" + 0.016*"컨트롤" + 0.016*"컨트롤타워" + 0.011*"단속" + 0.011*"음료" + 0.011*"마약음료" + 0.011*"경찰" + 0.011*"예방"

for topic in lda_data_with.print_topics(num_topics=13):
    topic_num, topic_keywords = topic
    print(f"{topic_num} : {topic_keywords}")

0 : 0.087*"마약" + 0.016*"동남아" + 0.015*"마약왕" + 0.014*"범죄" + 0.011*"구속" + 0.010*"총책" + 0.009*"손자" + 0.008*"투약" + 0.007*"대통령" + 0.007*"전두환"
1 : 0.093*"마약" + 0.026*"전쟁" + 0.018*"범죄" + 0.016*"수사" + 0.011*"마약범죄" + 0.009*"한동훈" + 0.009*"검찰" + 0.008*"대검" + 0.007*"참사" + 0.007*"마약사범"
2 : 0.065*"마약" + 0.011*"마약왕" + 0.009*"국내" + 0.008*"징역형" + 0.008*"부모" + 0.006*"신고" + 0.005*"적발" + 0.005*"참사" + 0.005*"검찰" + 0.005*"태국인"
3 : 0.110*"마약" + 0.061*"음료" + 0.039*"마약음료" + 0.019*"강남" + 0.016*"학원가" + 0.011*"필로폰" + 0.011*"경찰" + 0.010*"투약" + 0.008*"수사" + 0.008*"조직"
4 : 0.098*"마약" + 0.035*"수사" + 0.027*"특별" + 0.021*"범죄" + 0.016*"마약범죄" + 0.015*"출범" + 0.015*"특별수사팀" + 0.013*"청소년" + 0.011*"전쟁" + 0.011*"정부"
5 : 0.092*"마약" + 0.032*"마약왕" + 0.029*"수리" + 0.025*"수리남" + 0.018*"멕시코" + 0.016*"밀반입" + 0.016*"반입" + 0.012*"검거" + 0.010*"미국" + 0.008*"사망"
6 : 0.086*"마약" + 0.016*"정국" + 0.015*"음주" + 0.014*"옛말" + 0.013*"뺑소니" + 0.012*"마약청정국" + 0.010*"무면허" + 0.010*"운전자" + 0.010*"사고" + 0.010*"하수"
7 : 0.099*"마약" + 0.026*"범죄" + 0.018*"전쟁" + 0.014*"일상" + 0.014*"손자" + 0.013*"투약" + 0.013*"전두환" + 0.012*"경찰" + 0.011*"밀수" + 0.011*"마약범죄"
8 : 0.101*"마약" + 0.056*"검거" + 0.036*"경찰" + 0.023*"투약" + 0.018*"마약사범" + 0.017*"무더기" + 0.016*"유아" + 0.015*"수사" + 0.015*"혐의" + 0.014*"필로폰"
9 : 0.101*"마약" + 0.022*"한국" + 0.017*"명분" + 0.013*"밀수" + 0.013*"클럽" + 0.012*"강남" + 0.012*"마약사범" + 0.010*"텔레그램" + 0.008*"수사" + 0.008*"조직"
10 : 0.129*"마약" + 0.024*"범죄" + 0.016*"중독" + 0.013*"수사" + 0.013*"한동훈" + 0.011*"보이스피싱" + 0.011*"치료" + 0.011*"보이스" + 0.010*"대검" + 0.009*"특수"
11 : 0.124*"마약" + 0.016*"전쟁" + 0.015*"수사" + 0.012*"대통령" + 0.012*"유통" + 0.011*"마약사범" + 0.010*"단속" + 0.010*"적발" + 0.010*"경찰" + 0.008*"강남"
12 : 0.105*"마약" + 0.041*"마약사범" + 0.015*"유혹" + 0.012*"급증" + 0.011*"처벌" + 0.011*"세대" + 0.010*"거래" + 0.010*"청정" + 0.010*"지위" + 0.010*"중학생"

Doc2bow analysis shows Group B (high-comment articles) features topics like Han Dong-hoon (government policy) and Chun Doo-hwan's grandson's drug case
Decided direct TF-IDF comparison provides better explanatory power
TF-IDF embedding with topic modeling yielded insignificant results

Next step: Comparing groups using TF-IDF technique only

import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import LdaModel, TfidfModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt

# Calculate TF-IDF for each dataset
tfidf_data_without = TfidfModel(corpus_data_without)
tfidf_data_with = TfidfModel(corpus_data_with)

# Convert the corpus to a TF-IDF representation
corpus_tfidf_data_without = tfidf_data_without[corpus_data_without]
corpus_tfidf_data_with = tfidf_data_with[corpus_data_with]


# Find the top N keywords based on average TF-IDF scores
def get_top_N_keywords(tfidf_corpus, dictionary, N):
    avg_tfidf = np.zeros(len(dictionary))

    for doc in tfidf_corpus:
        for term_id, tfidf_score in doc:
            avg_tfidf[term_id] += tfidf_score

    avg_tfidf /= len(tfidf_corpus)
    top_N_indices = avg_tfidf.argsort()[-N:][::-1]
    top_N_keywords = [(dictionary[i], avg_tfidf[i]) for i in top_N_indices]

    return top_N_keywords

# Find the top N keywords for each dataset
N = 10
top_N_keywords_data_without = get_top_N_keywords(corpus_tfidf_data_without, dictionary_data_without, N)
top_N_keywords_data_with = get_top_N_keywords(corpus_tfidf_data_with, dictionary_data_with, N)

# Compare the top N keywords between the two datasets
print("Top 10 keywords in A그룹(0 comments):", top_N_keywords_data_without)
print("Top 10 keywords in B그룹(1 or more comments):", top_N_keywords_data_with)

Top 10 keywords in A그룹(0 comments): [('범죄', 0.025118233782760943), ('마약사범', 0.021923877795007648), ('청소년', 0.021809244757240193), ('예방', 0.019322443618203237), ('수사', 0.01771660649457), ('전쟁', 0.01769250384200995), ('마약범죄', 0.017074217372907808), ('경찰', 0.015992481921368002), ('마약왕', 0.015234999038888414), ('음료', 0.014510423614610218)]
Top 10 keywords in B그룹(1 or more comments): [('전쟁', 0.0212565977159329), ('마약사범', 0.019878078055057705), ('수사', 0.01806327796821882), ('음료', 0.017546353894940758), ('범죄', 0.016550337022138394), ('경찰', 0.015130577531840132), ('투약', 0.014714449071572759), ('검거', 0.01431982179820058), ('마약음료', 0.012922448066324253), ('강남', 0.012424066527380168)]

Results show top 10 TF-IDF keywords from both groups. Similar keywords appear across groups (e.g., Gangnam drug drink incident), reflecting similar headlines during specific reporting periods.

import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
import numpy as np

# Merge the two dictionaries
merged_dictionary = Dictionary(documents=data_without['processed'].tolist() + data_with['processed'].tolist())

# Convert the original corpora to the merged dictionary
corpus_data_without_merged = [merged_dictionary.doc2bow(doc) for doc in data_without['processed']]
corpus_data_with_merged = [merged_dictionary.doc2bow(doc) for doc in data_with['processed']]

# Update the TfidfModels
tfidf_data_without = TfidfModel(corpus_data_without_merged)
tfidf_data_with = TfidfModel(corpus_data_with_merged)

# Convert the corpus to a TF-IDF representation
corpus_tfidf_data_without_merged = tfidf_data_without[corpus_data_without_merged]
corpus_tfidf_data_with_merged = tfidf_data_with[corpus_data_with_merged]

# Calculate the average TF-IDF scores for each term in the given corpus
def calculate_avg_tfidf(tfidf_corpus, dictionary):
    avg_tfidf = np.zeros(len(dictionary))

    for doc in tfidf_corpus:
        for term_id, tfidf_score in doc:
            avg_tfidf[term_id] += tfidf_score

    avg_tfidf /= len(tfidf_corpus)

    return avg_tfidf

# Calculate average TF-IDF scores for each dataset
avg_tfidf_data_without = calculate_avg_tfidf(corpus_tfidf_data_without_merged, merged_dictionary)
avg_tfidf_data_with = calculate_avg_tfidf(corpus_tfidf_data_with_merged, merged_dictionary)

# Calculate the difference in average TF-IDF scores between the two datasets
tfidf_diff = avg_tfidf_data_with - avg_tfidf_data_without

# Sort the terms based on the difference in their average scores (higher in data_with)
sorted_indices = np.argsort(tfidf_diff)[::-1]

# Print the terms with their average TF-IDF scores in both groups and their difference
print(f"{'Term':<40}{'Group A':<10}{'Group B':<10}{'Difference':<10}")
print("-" * 50)

for i in sorted_indices[:40]:  # Display the top 20 terms
    term = merged_dictionary[i]
    group_a_score = avg_tfidf_data_without[i]
    group_b_score = avg_tfidf_data_with[i]
    diff = tfidf_diff[i]
    print(f"{term:<40}{group_a_score:<10.4f}{group_b_score:<10.4f}{diff:<10.4f}")

Term Group A Group B Difference
--------------------------------------------------
한동훈 0.0011 0.0068 0.0057
강력부 0.0000 0.0054 0.0054
급증 0.0040 0.0092 0.0052
조직 0.0045 0.0097 0.0052
중국 0.0000 0.0048 0.0048
파티 0.0000 0.0048 0.0048
필로폰 0.0064 0.0110 0.0047
강남 0.0078 0.0124 0.0046
명분 0.0023 0.0067 0.0044
유혹 0.0012 0.0056 0.0044
특수 0.0027 0.0070 0.0043
지시 0.0018 0.0059 0.0041
사망 0.0015 0.0055 0.0039
손자 0.0014 0.0051 0.0037
대마 0.0028 0.0065 0.0036
전쟁 0.0177 0.0213 0.0036
특수본 0.0015 0.0051 0.0036
대검 0.0034 0.0069 0.0035
단속 0.0060 0.0095 0.0035
학생 0.0017 0.0052 0.0035
신종 0.0000 0.0034 0.0034
검출 0.0000 0.0033 0.0033
선포 0.0024 0.0057 0.0032
중학생 0.0000 0.0032 0.0032
전두환 0.0014 0.0045 0.0032
폭행 0.0000 0.0031 0.0031
전면전 0.0015 0.0046 0.0031
배달 0.0000 0.0031 0.0031
부산 0.0000 0.0031 0.0031
유통 0.0068 0.0099 0.0031
음료 0.0145 0.0175 0.0030
대치동 0.0000 0.0030 0.0030
대치 0.0000 0.0030 0.0030
공급 0.0016 0.0046 0.0030
커피 0.0000 0.0029 0.0029
옛말 0.0000 0.0029 0.0029
알바 0.0000 0.0029 0.0029
무더기 0.0028 0.0057 0.0028
어른 0.0000 0.0028 0.0028
검찰 0.0055 0.0083 0.0028

TF-IDF score comparison identified terms exclusive to Group B (articles with comments, zero value in Group A).
Analysis revealed 5 key patterns in high-engagement articles:
1. Political/drug news (e.g., Han Dong-hoon, Chun Doo-hwan cases)
2. Specific drug names (methamphetamine, marijuana, new synthetic drugs)
3. Local incident details (Gangnam drink case, specific locations)
4. Distribution channels (delivery, supply, part-time workers)
5. Sensational language (surge, all-out war, mass arrests)

Next step: To avoid logical leaps that could occur if these patterns appear in both groups, we will now examine terms that are distinctively characteristic of Group A (articles without comments).

import pandas as pd
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
import numpy as np

# Merge the two dictionaries
merged_dictionary = Dictionary(documents=data_without['processed'].tolist() + data_with['processed'].tolist())

# Convert the original corpora to the merged dictionary
corpus_data_without_merged = [merged_dictionary.doc2bow(doc) for doc in data_without['processed']]
corpus_data_with_merged = [merged_dictionary.doc2bow(doc) for doc in data_with['processed']]

# Update the TfidfModels
tfidf_data_without = TfidfModel(corpus_data_without_merged)
tfidf_data_with = TfidfModel(corpus_data_with_merged)

# Convert the corpus to a TF-IDF representation
corpus_tfidf_data_without_merged = tfidf_data_without[corpus_data_without_merged]
corpus_tfidf_data_with_merged = tfidf_data_with[corpus_data_with_merged]

# Calculate the average TF-IDF scores for each term in the given corpus
def calculate_avg_tfidf(tfidf_corpus, dictionary):
    avg_tfidf = np.zeros(len(dictionary))

    for doc in tfidf_corpus:
        for term_id, tfidf_score in doc:
            avg_tfidf[term_id] += tfidf_score

    avg_tfidf /= len(tfidf_corpus)

    return avg_tfidf

# Calculate average TF-IDF scores for each dataset
avg_tfidf_data_without = calculate_avg_tfidf(corpus_tfidf_data_without_merged, merged_dictionary)
avg_tfidf_data_with = calculate_avg_tfidf(corpus_tfidf_data_with_merged, merged_dictionary)

# Calculate the difference in average TF-IDF scores between the two datasets
tfidf_diff = avg_tfidf_data_without - avg_tfidf_data_with 

# Sort the terms based on the difference in their average scores (higher in data_with)
sorted_indices = np.argsort(tfidf_diff)[::-1]

# Print the terms with their average TF-IDF scores in both groups and their difference
print(f"{'Term':<40}{'Group A':<10}{'Group B':<10}{'Difference':<10}")
print("-" * 50)

for i in sorted_indices[:40]:  # Display the top 20 terms
    term = merged_dictionary[i]
    group_a_score = avg_tfidf_data_without[i]
    group_b_score = avg_tfidf_data_with[i]
    diff = tfidf_diff[i]
    print(f"{term:<40}{group_a_score:<10.4f}{group_b_score:<10.4f}{diff:<10.4f}")

Term Group A Group B Difference
--------------------------------------------------
예방 0.0193 0.0029 0.0164
청소년 0.0218 0.0098 0.0120
캠페인 0.0118 0.0010 0.0108
범죄 0.0251 0.0166 0.0086
교육 0.0104 0.0021 0.0083
마약범죄 0.0171 0.0092 0.0078
진통제 0.0090 0.0023 0.0067
근절 0.0093 0.0027 0.0066
재활 0.0116 0.0052 0.0064
릴레이 0.0063 0.0000 0.0063
도시 0.0060 0.0000 0.0060
외국인 0.0086 0.0025 0.0060
필요 0.0081 0.0021 0.0060
반입 0.0094 0.0039 0.0055
컨트롤타워 0.0080 0.0025 0.0055
컨트롤 0.0080 0.0025 0.0055
예방교육 0.0065 0.0011 0.0054
고교생 0.0085 0.0031 0.0054
타워 0.0080 0.0028 0.0052
지원 0.0068 0.0017 0.0051
관세청 0.0072 0.0022 0.0050
서울시 0.0054 0.0005 0.0049
환자 0.0053 0.0004 0.0049
마약중독 0.0080 0.0032 0.0048
발의 0.0051 0.0005 0.0046
경찰서 0.0046 0.0000 0.0046
증가 0.0084 0.0038 0.0046
마약성 0.0064 0.0020 0.0043
학교 0.0058 0.0014 0.0043
마약왕 0.0152 0.0109 0.0043
실시 0.0048 0.0004 0.0043
히틀러 0.0055 0.0012 0.0043
불법 0.0082 0.0041 0.0042
임계점 0.0050 0.0008 0.0042
강남구 0.0042 0.0000 0.0042
교육청 0.0049 0.0008 0.0041
중독자 0.0068 0.0027 0.0041
조례 0.0041 0.0000 0.0041
수도 0.0046 0.0005 0.0041
치료시설 0.0041 0.0000 0.0041

Searched for counterexamples to Group B's five characteristics in Group A. While the term ‘강남구(Gangnam-gu)’ could potentially counter Group B’s third characteristic (specific incidents) as referencing the Gangnam drink incident, its absence in Group B makes it unsuitable for comparison. No other counterexamples were found.

Next step: After completing our TF-IDF analysis, we will explore keyword characteristics using Word2vec for additional insights.
Word2vec: Neural network technique that converts words to vectors based on context, capturing semantic relationships between words that appear in similar contexts

import pandas as pd
import numpy as np
from gensim.models import Word2Vec

# Combine the preprocessed text from both groups
combined_text = data_without['processed'].tolist() + data_with['processed'].tolist()

# Train a Word2Vec model using both datasets
model = Word2Vec(sentences=combined_text, vector_size=100, window=5, min_count=1, workers=4)

# Calculate the average word embeddings for each article in both groups
def average_word_embeddings(text, word2vec_model):
    embeddings = []
    for word in text:
        if word in word2vec_model.wv:
            embeddings.append(word2vec_model.wv[word])
    if embeddings:
        return np.mean(embeddings, axis=0)
    else:
        return np.zeros((model.vector_size,))

data_without['avg_word_embeddings'] = data_without['processed'].apply(average_word_embeddings, word2vec_model=model)
data_with['avg_word_embeddings'] = data_with['processed'].apply(average_word_embeddings, word2vec_model=model)

# Calculate the average word embeddings for both groups
group_a_avg = np.mean(np.vstack(data_without['avg_word_embeddings']), axis=0)
group_b_avg = np.mean(np.vstack(data_with['avg_word_embeddings']), axis=0)

# Get all unique keywords from both groups
keywords = set()
for text in combined_text:
    keywords.update(text)

keywords = list(keywords)
keyword_vectors = np.vstack([model.wv[keyword] for keyword in keywords])

# Calculate the similarity scores between the keywords and the average word embeddings for both groups
keyword_scores_a = np.dot(keyword_vectors, group_a_avg)
keyword_scores_b = np.dot(keyword_vectors, group_b_avg)

# Calculate the difference in similarity scores between the two groups for each keyword
keyword_diffs = keyword_scores_b - keyword_scores_a

# Sort the keywords based on the difference in their embeddings (higher in data_with)
sorted_indices = np.argsort(keyword_diffs)[::-1]

# Print the keywords with their similarity scores in both groups and their difference
print(f"{'Term':<40}{'Group A':<10}{'Group B':<10}{'Difference':<10}")
print("-" * 50)

count = 0
for i in sorted_indices:
    if count >= 40:  # Display the top 20 terms
        break

    term = keywords[i]
    group_a_score = keyword_scores_a[i]
    group_b_score = keyword_scores_b[i]
    diff = keyword_diffs[i]

    print(f"{term:<40}{group_a_score:<10.4f}{group_b_score:<10.4f}{diff:<10.4f}")
    count += 1

Term Group A Group B Difference
--------------------------------------------------
압수물 -0.0041 -0.0040 0.0001
절대 -0.0026 -0.0025 0.0000
이주 -0.0020 -0.0020 0.0000
도주 -0.0023 -0.0023 0.0000
시위 -0.0007 -0.0007 0.0000
신전 -0.0024 -0.0023 0.0000
가방 -0.0020 -0.0020 0.0000
즉각 -0.0012 -0.0011 0.0000
평창경찰 -0.0028 -0.0028 0.0000
다크웹으 -0.0016 -0.0015 0.0000
버닝 -0.0011 -0.0010 0.0000
만명 -0.0017 -0.0017 0.0000
동해 -0.0028 -0.0027 0.0000
패소 -0.0023 -0.0023 0.0000
연결 -0.0013 -0.0012 0.0000
전방위 -0.0015 -0.0014 0.0000
다이어트약 -0.0012 -0.0012 0.0000
휴일 -0.0014 -0.0014 0.0000
페스티벌 -0.0016 -0.0016 0.0000
방해 -0.0019 -0.0019 0.0000
반성 -0.0010 -0.0009 0.0000
경필 -0.0016 -0.0015 0.0000
방책 -0.0019 -0.0019 0.0000
포퓰리즘 -0.0022 -0.0022 0.0000
관물대 -0.0024 -0.0024 0.0000
채팅 -0.0012 -0.0012 0.0000
중국발 -0.0003 -0.0003 0.0000
군수 -0.0002 -0.0002 0.0000
위력순찰 -0.0009 -0.0009 0.0000
우주 -0.0013 -0.0013 0.0000
불안하다 -0.0014 -0.0014 0.0000
전쟁살인 -0.0014 -0.0014 0.0000
나르키소스 -0.0019 -0.0019 0.0000
포르 -0.0010 -0.0010 0.0000
계파 -0.0009 -0.0009 0.0000
마조부 -0.0002 -0.0002 0.0000
주택 -0.0009 -0.0009 0.0000
충청권 -0.0012 -0.0012 0.0000
원석 -0.0017 -0.0017 0.0000
술잔 -0.0004 -0.0003 0.0000

Since word2vec finds semantic similarities by analyzing contextual flow, no significant differences were observed.

Next step: Frequency analysis and word cloud visualization

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc

# 한글 폰트 설정
font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
rc('font', family=font_name)

# # Create a function to clean tokens
# def clean_token(token):
#     return token.replace("[", "").replace("]", "").replace("'", "").strip()

def word_graph(ax, cnt, group_name, max_words=10):
    sorted_w = sorted(cnt.items(), key=lambda kv: kv[1])
    print(f"{group_name}:\n", sorted_w[-max_words:])
    w, n = zip(*sorted_w[-max_words:])
    ax.barh(range(len(w)), n, tick_label=w)
    ax.set_title(group_name)
    
# 텍스트 정제 함수: 한글 이외의 문자는 전부 제거합니다.
def clean_token(text):
    # 한글의 정규표현식으로 한글만 추출합니다.
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
    result = hangul.sub('', text)
    return result

def split_text(text):
    tokens = text.split()
    return tokens

# 텍스트 정제를 적용하여 처리된 텍스트를 확인합니다.
data_without['title_cleaned'] = data_without['title_tokenized'].apply(text_cleaning)
data_with['title_cleaned'] = data_with['title_tokenized'].apply(text_cleaning)

data_without['processed'] = data_without['title_cleaned'].apply(split_text)
data_with['processed'] = data_with['title_cleaned'].apply(split_text)

fig, axes = plt.subplots(2, 1, figsize=(10, 12))

# Group A
data_A = pd.read_csv('data_without_comments4.csv')
data_A['title_cleaned'] = data_A['title_tokenized'].apply(text_cleaning)
data_A['processed'] = data_A['title_cleaned'].apply(split_text)

content_tokens_A = data_A['processed']

tokens_cnt_A = {}
for tokens in content_tokens_A:
    for token in tokens:
        cleaned_token = clean_token(token)
        if cleaned_token:
            tokens_cnt_A[cleaned_token] = tokens_cnt_A.get(cleaned_token, 0) + 1

word_graph(axes[0], tokens_cnt_A, "Group A", max_words=20)

# Group B
data_B = pd.read_csv('data_with_comments4.csv')
data_B['title_cleaned'] = data_B['title_tokenized'].apply(text_cleaning)
data_B['processed'] = data_B['title_cleaned'].apply(split_text)

content_tokens_B = data_B['processed']

tokens_cnt_B = {}
for tokens in content_tokens_B:
    for token in tokens:
        cleaned_token = clean_token(token)
        if cleaned_token:
            tokens_cnt_B[cleaned_token] = tokens_cnt_B.get(cleaned_token, 0) + 1

word_graph(axes[1], tokens_cnt_B, "Group B", max_words=20)

plt.tight_layout()
plt.show()

Term Group A Group B Difference
--------------------------------------------------
압수물 -0.0041 -0.0040 0.0001
절대 -0.0026 -0.0025 0.0000
이주 -0.0020 -0.0020 0.0000
도주 -0.0023 -0.0023 0.0000
시위 -0.0007 -0.0007 0.0000
신전 -0.0024 -0.0023 0.0000
가방 -0.0020 -0.0020 0.0000
즉각 -0.0012 -0.0011 0.0000
평창경찰 -0.0028 -0.0028 0.0000
다크웹으 -0.0016 -0.0015 0.0000
버닝 -0.0011 -0.0010 0.0000
만명 -0.0017 -0.0017 0.0000
동해 -0.0028 -0.0027 0.0000
패소 -0.0023 -0.0023 0.0000
연결 -0.0013 -0.0012 0.0000
전방위 -0.0015 -0.0014 0.0000
다이어트약 -0.0012 -0.0012 0.0000
휴일 -0.0014 -0.0014 0.0000
페스티벌 -0.0016 -0.0016 0.0000
방해 -0.0019 -0.0019 0.0000
반성 -0.0010 -0.0009 0.0000
경필 -0.0016 -0.0015 0.0000
방책 -0.0019 -0.0019 0.0000
포퓰리즘 -0.0022 -0.0022 0.0000
관물대 -0.0024 -0.0024 0.0000
채팅 -0.0012 -0.0012 0.0000
중국발 -0.0003 -0.0003 0.0000
군수 -0.0002 -0.0002 0.0000
위력순찰 -0.0009 -0.0009 0.0000
우주 -0.0013 -0.0013 0.0000
불안하다 -0.0014 -0.0014 0.0000
전쟁살인 -0.0014 -0.0014 0.0000
나르키소스 -0.0019 -0.0019 0.0000
포르 -0.0010 -0.0010 0.0000
계파 -0.0009 -0.0009 0.0000
마조부 -0.0002 -0.0002 0.0000
주택 -0.0009 -0.0009 0.0000
충청권 -0.0012 -0.0012 0.0000
원석 -0.0017 -0.0017 0.0000
술잔 -0.0004 -0.0003 0.0000

Since word2vec finds semantic similarities by analyzing contextual flow, no significant differences were observed.

Next step: Frequency analysis and word cloud visualization

Punishiment or prevention?: A machine learning analysis of drug-related news coverage across cultures

#3-2 Title-Comment Count Feature Comparison

Punishiment or prevention?: A machine learning analysis
of drug-related news coverage across cultures