diff --git a/Day_002_Core_Data_Processing.md b/Day_002_Core_Data_Processing.md index f64e48f..c3e2841 100644 --- a/Day_002_Core_Data_Processing.md +++ b/Day_002_Core_Data_Processing.md @@ -1,63 +1,75 @@ +## **Day 2 β€” Core Data Processing with Python** +**File:** `Day2_Core_Data_Processing.md` + +```markdown # πŸ“… Day 2 β€” Core Data Processing with Python ## 🎯 Goal -Transform the raw data into structured, insightful information using Python’s analytical power. +Transform raw data into structured, insightful information using Python. --- ## 🧩 Tasks -### 1. Integrate Python into n8n -- Add an **Execute Code** node after the Merge node (from Day 1). -- This node receives combined JSON data from sales and reviews. - ----- - -### 2. Write the Python Script (Data Cleaning & Aggregation) -- Use **Pandas** for structured data manipulation. -- Inside the Execute Code node: - - Load the JSON input into two DataFrames: - ```python - import pandas as pd - - sales_df = pd.DataFrame($json["sales"]) - reviews_df = pd.DataFrame($json["reviews"]) - ``` - - Clean the data: - - Handle missing values. - - Convert data types. - - Remove duplicates. - - Aggregate sales data: - ```python - sales_summary = sales_df.groupby("product_id").agg( - total_revenue=("price", "sum"), - units_sold=("quantity", "sum") - ).reset_index() - ``` +### 1. Integrate Python in n8n +- Add an **Execute Code** node after the Merge node (from Day 1). +- Receive combined JSON data from sales and reviews. --- -### 3. Add Sentiment Analysis -- Use **VADER** from the `nltk` library for text sentiment scoring. +### 2. Data Cleaning & Aggregation +- Load JSON data into two Pandas DataFrames: ```python - from nltk.sentiment.vader import SentimentIntensityAnalyzer - sid = SentimentIntensityAnalyzer() + import pandas as pd - reviews_df["sentiment_score"] = reviews_df["review_text"].apply( - lambda text: sid.polarity_scores(text)["compound"] - ) + sales_df = pd.DataFrame($json["sales"]) + reviews_df = pd.DataFrame($json["reviews"]) +Clean data: handle missing values, convert data types, remove duplicates. +Aggregate sales data: +python +Copy code +sales_summary = sales_df.groupby("product_id").agg( + total_revenue=("price", "sum"), + units_sold=("quantity", "sum") +).reset_index() +3. Sentiment Analysis +Use NLTK VADER for review sentiment: -### Aggregate sentiment data: +python +Copy code +from nltk.sentiment.vader import SentimentIntensityAnalyzer +sid = SentimentIntensityAnalyzer() +reviews_df["sentiment_score"] = reviews_df["review_text"].apply( + lambda text: sid.polarity_scores(text)["compound"] +) +Categorize sentiment: Positive, Neutral, Negative. + +Aggregate sentiment per product: + +python +Copy code sentiment_summary = reviews_df.groupby("product_id").agg( avg_sentiment_score=("sentiment_score", "mean"), num_reviews=("review_text", "count") ).reset_index() +4. Combine & Output +Merge aggregated sales and sentiment: - -### Merge with sales data: - +python +Copy code final_df = pd.merge(sales_summary, sentiment_summary, on="product_id", how="left") -return json.loads(final_df.to_json(orient="records")) \ No newline at end of file +return json.loads(final_df.to_json(orient="records")) +βœ… Deliverable +Python node outputs a clean JSON object containing: + +Aggregated sales data + +Sentiment scores per product + +πŸ’‘ Solution +Combined DataFrame ready for storage and reporting in Day 3. + +All data is clean, structured, and enriched. \ No newline at end of file