Day 2 — Core Data Processing with Python_Updated
Day 2 — Core Data Processing with Python_Updated
This commit is contained in:
@@ -1,63 +1,75 @@
|
|||||||
|
## **Day 2 — Core Data Processing with Python**
|
||||||
|
**File:** `Day2_Core_Data_Processing.md`
|
||||||
|
|
||||||
|
```markdown
|
||||||
# 📅 Day 2 — Core Data Processing with Python
|
# 📅 Day 2 — Core Data Processing with Python
|
||||||
|
|
||||||
## 🎯 Goal
|
## 🎯 Goal
|
||||||
Transform the raw data into structured, insightful information using Python’s analytical power.
|
Transform raw data into structured, insightful information using Python.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 🧩 Tasks
|
## 🧩 Tasks
|
||||||
|
|
||||||
### 1. Integrate Python into n8n
|
### 1. Integrate Python in n8n
|
||||||
- Add an **Execute Code** node after the Merge node (from Day 1).
|
- Add an **Execute Code** node after the Merge node (from Day 1).
|
||||||
- This node receives combined JSON data from sales and reviews.
|
- Receive combined JSON data from sales and reviews.
|
||||||
|
|
||||||
----
|
---
|
||||||
|
|
||||||
### 2. Write the Python Script (Data Cleaning & Aggregation)
|
### 2. Data Cleaning & Aggregation
|
||||||
- Use **Pandas** for structured data manipulation.
|
- Load JSON data into two Pandas DataFrames:
|
||||||
- Inside the Execute Code node:
|
|
||||||
- Load the JSON input into two DataFrames:
|
|
||||||
```python
|
```python
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
|
|
||||||
sales_df = pd.DataFrame($json["sales"])
|
sales_df = pd.DataFrame($json["sales"])
|
||||||
reviews_df = pd.DataFrame($json["reviews"])
|
reviews_df = pd.DataFrame($json["reviews"])
|
||||||
```
|
Clean data: handle missing values, convert data types, remove duplicates.
|
||||||
- Clean the data:
|
|
||||||
- Handle missing values.
|
Aggregate sales data:
|
||||||
- Convert data types.
|
|
||||||
- Remove duplicates.
|
python
|
||||||
- Aggregate sales data:
|
Copy code
|
||||||
```python
|
sales_summary = sales_df.groupby("product_id").agg(
|
||||||
sales_summary = sales_df.groupby("product_id").agg(
|
|
||||||
total_revenue=("price", "sum"),
|
total_revenue=("price", "sum"),
|
||||||
units_sold=("quantity", "sum")
|
units_sold=("quantity", "sum")
|
||||||
).reset_index()
|
).reset_index()
|
||||||
```
|
3. Sentiment Analysis
|
||||||
|
Use NLTK VADER for review sentiment:
|
||||||
|
|
||||||
---
|
python
|
||||||
|
Copy code
|
||||||
|
from nltk.sentiment.vader import SentimentIntensityAnalyzer
|
||||||
|
sid = SentimentIntensityAnalyzer()
|
||||||
|
|
||||||
### 3. Add Sentiment Analysis
|
reviews_df["sentiment_score"] = reviews_df["review_text"].apply(
|
||||||
- Use **VADER** from the `nltk` library for text sentiment scoring.
|
|
||||||
```python
|
|
||||||
from nltk.sentiment.vader import SentimentIntensityAnalyzer
|
|
||||||
sid = SentimentIntensityAnalyzer()
|
|
||||||
|
|
||||||
reviews_df["sentiment_score"] = reviews_df["review_text"].apply(
|
|
||||||
lambda text: sid.polarity_scores(text)["compound"]
|
lambda text: sid.polarity_scores(text)["compound"]
|
||||||
)
|
)
|
||||||
|
Categorize sentiment: Positive, Neutral, Negative.
|
||||||
|
|
||||||
|
Aggregate sentiment per product:
|
||||||
|
|
||||||
|
python
|
||||||
### Aggregate sentiment data:
|
Copy code
|
||||||
|
|
||||||
sentiment_summary = reviews_df.groupby("product_id").agg(
|
sentiment_summary = reviews_df.groupby("product_id").agg(
|
||||||
avg_sentiment_score=("sentiment_score", "mean"),
|
avg_sentiment_score=("sentiment_score", "mean"),
|
||||||
num_reviews=("review_text", "count")
|
num_reviews=("review_text", "count")
|
||||||
).reset_index()
|
).reset_index()
|
||||||
|
4. Combine & Output
|
||||||
|
Merge aggregated sales and sentiment:
|
||||||
|
|
||||||
|
python
|
||||||
### Merge with sales data:
|
Copy code
|
||||||
|
|
||||||
final_df = pd.merge(sales_summary, sentiment_summary, on="product_id", how="left")
|
final_df = pd.merge(sales_summary, sentiment_summary, on="product_id", how="left")
|
||||||
return json.loads(final_df.to_json(orient="records"))
|
return json.loads(final_df.to_json(orient="records"))
|
||||||
|
✅ Deliverable
|
||||||
|
Python node outputs a clean JSON object containing:
|
||||||
|
|
||||||
|
Aggregated sales data
|
||||||
|
|
||||||
|
Sentiment scores per product
|
||||||
|
|
||||||
|
💡 Solution
|
||||||
|
Combined DataFrame ready for storage and reporting in Day 3.
|
||||||
|
|
||||||
|
All data is clean, structured, and enriched.
|
||||||
Reference in New Issue
Block a user