Day 2 — Core Data Processing with Python_Updated

2025-10-09 06:49:33 +00:00
parent 90435d8ab7
commit f8287a147f
1 changed files with 52 additions and 40 deletions
@@ -1,63 +1,75 @@
+## **Day 2 — Core Data Processing with Python**  
+**File:** `Day2_Core_Data_Processing.md`
+
+```markdown
 # 📅 Day 2 — Core Data Processing with Python

 ## 🎯 Goal
-Transform the raw data into structured, insightful information using Python’s analytical power.
+Transform raw data into structured, insightful information using Python.

 ---

 ## 🧩 Tasks

-### 1. Integrate Python into n8n
+### 1. Integrate Python in n8n
 - Add an **Execute Code** node after the Merge node (from Day 1).  
- This node receives combined JSON data from sales and reviews.
+- Receive combined JSON data from sales and reviews.

----
+---

-### 2. Write the Python Script (Data Cleaning & Aggregation)
- Use **Pandas** for structured data manipulation.
- Inside the Execute Code node:
-  - Load the JSON input into two DataFrames:
+### 2. Data Cleaning & Aggregation
+- Load JSON data into two Pandas DataFrames:
  ```python
  import pandas as pd

  sales_df = pd.DataFrame($json["sales"])
  reviews_df = pd.DataFrame($json["reviews"])
-    ```
-  - Clean the data:
-    - Handle missing values.
-    - Convert data types.
-    - Remove duplicates.
-  - Aggregate sales data:
-    ```python
-    sales_summary = sales_df.groupby("product_id").agg(
+Clean data: handle missing values, convert data types, remove duplicates.
+
+Aggregate sales data:
+
+python
+Copy code
+sales_summary = sales_df.groupby("product_id").agg(
    total_revenue=("price", "sum"),
    units_sold=("quantity", "sum")
-    ).reset_index()
-    ```
+).reset_index()
+3. Sentiment Analysis
+Use NLTK VADER for review sentiment:

---
+python
+Copy code
+from nltk.sentiment.vader import SentimentIntensityAnalyzer
+sid = SentimentIntensityAnalyzer()

-### 3. Add Sentiment Analysis
- Use **VADER** from the `nltk` library for text sentiment scoring.
-  ```python
-  from nltk.sentiment.vader import SentimentIntensityAnalyzer
-  sid = SentimentIntensityAnalyzer()
-
-  reviews_df["sentiment_score"] = reviews_df["review_text"].apply(
+reviews_df["sentiment_score"] = reviews_df["review_text"].apply(
    lambda text: sid.polarity_scores(text)["compound"]
-  )
+)
+Categorize sentiment: Positive, Neutral, Negative.

+Aggregate sentiment per product:

-
-### Aggregate sentiment data:
-
+python
+Copy code
 sentiment_summary = reviews_df.groupby("product_id").agg(
    avg_sentiment_score=("sentiment_score", "mean"),
    num_reviews=("review_text", "count")
 ).reset_index()
+4. Combine & Output
+Merge aggregated sales and sentiment:

-
-### Merge with sales data:
-
+python
+Copy code
 final_df = pd.merge(sales_summary, sentiment_summary, on="product_id", how="left")
 return json.loads(final_df.to_json(orient="records"))
+✅ Deliverable
+Python node outputs a clean JSON object containing:
+
+Aggregated sales data
+
+Sentiment scores per product
+
+💡 Solution
+Combined DataFrame ready for storage and reporting in Day 3.
+
+All data is clean, structured, and enriched.