Chat Screenshot Intelligence
research

Chat Screenshot Intelligence

Research initiative at Samsung transforming chat screenshots into actionable sentiment intelligence using OCR, speaker diarization, and fine-tuned RoBERTa transformers.

Role

Research Intern

Duration

6 months

Team Size

4 people

Technologies Used

BERTRoBERTaOpenCVEasyOCRPythonNLPTransformers

Project Overview

What happens when customer complaints arrive as screenshots instead of text? At Samsung PRISM Research, I tackled this fascinating problem — building a pipeline that extracts emotional intelligence from chat screenshots with 0.92+ F1-score across WhatsApp, Slack, and Samsung Messages.

This wasn't just OCR → Sentiment. We solved the hard problems: speaker attribution from bubble positioning, sarcasm detection in chat slang, and privacy-first PII redaction — all production-grade.

The Research Challenge

Traditional sentiment analysis works on text. But in the real world:

  • Customer grievances are shared as screenshots on social media
  • Support QA teams review images of conversations, not transcripts
  • Competitive intelligence involves analyzing competitor app screenshots

We needed to bridge the gap between unstructured visual data and actionable emotional insights.

Key Innovations

Three-Stage Pipeline Architecture

Stage 1: Image Pre-processing & ROI Detection

  • Bilateral filtering for noise reduction while preserving text edges
  • Dynamic contrast enhancement (CLAHE) for low-contrast mobile UIs
  • Contour-based ROI extraction isolating chat content from UI chrome
  • 94.3% accuracy on 500-image test set across diverse apps

Stage 2: OCR with Speaker Context Preservation

  • EasyOCR selection over Tesseract (96.8% vs 89.3% confidence score)
  • Novel speaker attribution algorithm using spatial clustering
  • Left/right bubble positioning → User/Respondent classification
  • 97.1% accuracy in speaker attribution (validated on 200 screenshots)

Stage 3: Transformer-Based Sentiment Classification

  • Fine-tuned RoBERTa over BERT for superior sarcasm handling
  • Custom chat-specific preprocessing (emoji → text, slang normalization)
  • Speaker-level sentiment tracking with temporal trajectory analysis

Model Performance

Class Precision Recall F1-Score
Positive 0.912 0.935 0.923
Neutral 0.934 0.896 0.914
Negative 0.918 0.950 0.934
Overall - - 0.923

Privacy-First PII Redaction

Built hybrid NER + rule-based PII detection achieving 95.5% F1-score:

  • Named entity recognition for PERSON, PHONE, EMAIL, ADDRESS
  • Automatic redaction before cloud processing
  • GDPR/CCPA compliance built-in
  • Complete audit trails for regulatory evidence

Technical Deep Dive

Why RoBERTa Over BERT?

RoBERTa Advantages:
✓ No Next Sentence Prediction (removes noise)
✓ Dynamic masking (40 copies of training data)
✓ Larger batch sizes (8000 across 1024 GPUs training)
✓ 2-3% F1 improvement on sentiment tasks
✓ Better sarcasm capture — critical for chat analysis

Speaker Context Preservation Algorithm

# Novel contribution: Speaker attribution without explicit bubble detection
1. Spatial Clustering → Group OCR boxes by vertical proximity
2. Horizontal Position Analysis:
   - x_median < image_width/3 → User speaker
   - x_median > 2*image_width/3 → Respondent speaker
3. Temporal Sequencing → Order top-to-bottom
4. Insert <SPEAKER_CHANGE> tokens at conversation turns

Chat-Specific Text Normalization

  • Emoji conversion: 😠 → [ANGRY_FACE] (preserves sentiment signal)
  • Slang dictionary: 1000+ mappings ("lol" → "laugh out loud")
  • ALL-CAPS detection: 1.25× sentiment multiplier (shouting signal)
  • Punctuation intensity: "!!!" → 1.3× weight factor

Enterprise Product: Sentimeter AI

The research culminated in a CTO pitch for enterprise deployment:

  • Automated QA: Process thousands of support chats daily
  • Crisis Early Warning: Detect sentiment spikes before public escalation
  • Competitive Intelligence: Systematic analysis of competitor app feedback
  • Compliance Automation: GDPR/CCPA-ready with audit trails

Tech Stack

ML/NLP: RoBERTa, BERT, spaCy NER, HuggingFace Transformers Computer Vision: OpenCV, EasyOCR Backend: Python, Streamlit (prototype) Training: NVIDIA A100, AdamW optimizer, CrossEntropy with label smoothing

Key Learnings

Research taught me the value of ablation studies — our final 0.923 F1-score came from dozens of experiments comparing:

  • Tesseract vs EasyOCR (EasyOCR won handily)
  • BERT vs RoBERTa (RoBERTa's dynamic masking made the difference)
  • Various spatial clustering thresholds

The most elegant solution wasn't the most complex — our speaker attribution algorithm required no ML, just clever geometry.


This research showed me that the best AI systems often combine multiple modalities — here, computer vision, NLP, and spatial reasoning working together.

More Projects