Update Day_002_Core_Data_Processing.md
This commit is contained in:
63
Day_002_Core_Data_Processing.md
Normal file
63
Day_002_Core_Data_Processing.md
Normal file
@@ -0,0 +1,63 @@
|
||||
# 📅 Day 2 — Core Data Processing with Python
|
||||
|
||||
## 🎯 Goal
|
||||
Transform the raw data into structured, insightful information using Python’s analytical power.
|
||||
|
||||
---
|
||||
|
||||
## 🧩 Tasks
|
||||
|
||||
### 1. Integrate Python into n8n
|
||||
- Add an **Execute Code** node after the Merge node (from Day 1).
|
||||
- This node receives combined JSON data from sales and reviews.
|
||||
|
||||
---
|
||||
|
||||
### 2. Write the Python Script (Data Cleaning & Aggregation)
|
||||
- Use **Pandas** for structured data manipulation.
|
||||
- Inside the Execute Code node:
|
||||
- Load the JSON input into two DataFrames:
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
sales_df = pd.DataFrame($json["sales"])
|
||||
reviews_df = pd.DataFrame($json["reviews"])
|
||||
```
|
||||
- Clean the data:
|
||||
- Handle missing values.
|
||||
- Convert data types.
|
||||
- Remove duplicates.
|
||||
- Aggregate sales data:
|
||||
```python
|
||||
sales_summary = sales_df.groupby("product_id").agg(
|
||||
total_revenue=("price", "sum"),
|
||||
units_sold=("quantity", "sum")
|
||||
).reset_index()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. Add Sentiment Analysis
|
||||
- Use **VADER** from the `nltk` library for text sentiment scoring.
|
||||
```python
|
||||
from nltk.sentiment.vader import SentimentIntensityAnalyzer
|
||||
sid = SentimentIntensityAnalyzer()
|
||||
|
||||
reviews_df["sentiment_score"] = reviews_df["review_text"].apply(
|
||||
lambda text: sid.polarity_scores(text)["compound"]
|
||||
)
|
||||
|
||||
|
||||
|
||||
### Aggregate sentiment data:
|
||||
|
||||
sentiment_summary = reviews_df.groupby("product_id").agg(
|
||||
avg_sentiment_score=("sentiment_score", "mean"),
|
||||
num_reviews=("review_text", "count")
|
||||
).reset_index()
|
||||
|
||||
|
||||
### Merge with sales data:
|
||||
|
||||
final_df = pd.merge(sales_summary, sentiment_summary, on="product_id", how="left")
|
||||
return json.loads(final_df.to_json(orient="records"))
|
||||
Reference in New Issue
Block a user