Every machine learning model is only as good as its training data. In telecom, AI systems consume data from 8 distinct sources, each with different formats, volumes, latencies, and use cases. An engineer who understands these data sources can build better features, identify data quality issues faster, and design ML pipelines that actually work in production. This article takes you inside each data source — what it contains, how to access it, how to engineer features from it, and what pitfalls to watch for.

8
Data Sources
TB/day
Typical Volume
50+
ML Features
3GPP
Standardized
Raw Data
ETL Pipeline
Feature Eng.
ML Model
01

MDT Logs (Minimization of Drive Tests)

3GPP TS 32.422 — UE-reported RF measurements at scale

MDT allows operators to collect RF measurements (RSRP, RSRQ, SINR) directly from subscriber UEs without deploying drive test teams. Defined in 3GPP TS 32.422, MDT comes in two modes: Immediate MDT (real-time reporting for connected UEs) and Logged MDT (UE stores measurements in idle mode and uploads later). A single cell can generate 10K-100K MDT samples per day, providing coverage maps that no drive test could match.

1. UE Configuration

eNB/gNB sends MeasConfig to selected UEs. Logged MDT: UE stores RSRP/RSRQ with GPS coordinates and timestamps while in RRC_IDLE. Immediate MDT: UE reports during active sessions.

2. Collection

Logged MDT data uploaded when UE re-enters RRC_CONNECTED. eNB forwards to Trace Collection Entity (TCE). Volume: 100-500 bytes per sample, 50K-500K samples/cell/day.

3. Geo-Correlation

GPS coordinates mapped to coverage grid (50m x 50m bins). Aggregate RSRP_mean, RSRP_p5 (5th percentile), RSRQ distribution per bin. This creates a "crowdsourced coverage map."

4. ML Features

Coverage holes (RSRP < -110 dBm), weak coverage areas, overshooting cells (serving cell RSRP < neighbor), indoor vs. outdoor classification from altitude/speed.

MDT Coverage Heatmap — UE-reported RSRP measurements aggregated per 50m grid
100K+
Samples/Cell/Day
GPS
Geo-Tagged
Free
No Drive Test Cost
TS 32.422
3GPP Standard
Hands-On TaskIdentify the Coverage Problem

Given these MDT statistics for a cell, identify the most likely RF issue.

Quick Quiz
What is the key advantage of MDT over traditional drive testing?
AHigher accuracy GPS
BMassive scale: thousands of UEs reporting simultaneously vs. one drive test vehicle
CReal-time reporting only
DIt replaces all other data sources
Correct! MDT uses subscriber UEs as measurement probes, giving coverage data from thousands of locations simultaneously — far beyond what any drive test team could achieve.
MDT's key advantage is scale: thousands of UEs report measurements simultaneously, creating coverage maps no drive test could match.
02

Drive Test Data

Controlled RF measurements with professional equipment

Drive tests use professional UE scanners (TEMS, Nemo, XCAL) mounted in vehicles to collect high-resolution RF measurements along predefined routes. Unlike MDT, drive tests capture scanner data (all cells visible, not just serving), layer-3 messages (RRC, NAS), and application-level metrics (throughput, latency, MOS). The tradeoff: expensive ($500-2000/day) but ground truth for model validation.

Drive Test Route — Color-coded RSRP measurements along the test path
L3
Message Capture
10ms
Sample Rate
$1K+
Cost/Day
Scanner
All Cells Visible
Hands-On TaskDrive Test vs. MDT: When to Use Which?

Match the scenario to the right data source.

Quick Quiz
What can drive tests capture that MDT cannot?
ARSRP measurements
BScanner data showing all visible cells (not just serving)
CGPS coordinates
DTime stamps
Correct! Drive test scanners see ALL cells at each measurement point, enabling interference analysis and neighbor audits. MDT only reports serving + configured neighbor cells.
Drive test scanners capture all visible cells (serving + non-serving), enabling interference analysis that MDT cannot provide.
03

RAN PM Counters

3GPP TS 32.401/425 — The backbone of network analytics

Performance Management (PM) counters are the most widely used data source in telecom AI. Every eNB/gNB reports hundreds of counters every 15 minutes (configurable to 5 min): RRC.ConnEstab.Att, ERAB.EstabInit.Succ, HO.ExeSucc, PRB.Used.DL.Avg, DL.THP.Time. These counters feed KPI calculations (HOSR, Drop Rate, CSSR, Throughput) and are the primary input for most ML models.

CounterFull NameGranularityML Use
RRC.ConnEstab.AttRRC Connection AttemptsCell/15minAccessibility prediction
ERAB.RelAbnormal.RLFAbnormal Release (Radio Link Failure)Cell/15minCall drop prediction
HO.ExeSucc / HO.ExeAttHandover Success/AttemptsCell-pair/15minHOSR optimization
PRB.Used.DL.AvgAverage DL PRB UtilizationCell/15minCapacity prediction
DL.THP.TimeDL Throughput TimeCell/15minThroughput prediction
PM Counter Dashboard — Real-time KPI trends from RAN counters
500+
Counters/Cell
15 min
Default Granularity
1M+
Records/Day
KPIs
Derived From
Hands-On TaskCalculate a KPI from Raw Counters

Given counters, calculate the Handover Success Rate (HOSR).

1800
1900
HOSR = 94.7% | Target: 98% | Status: Below Target
Quick Quiz
Which 3GPP specification defines PM counter collection for RAN?
ATS 38.331
BTS 32.425 (PM for E-UTRAN) / TS 28.552 (5G)
CTS 23.501
DTS 32.422
Correct! TS 32.425 defines PM counters for E-UTRAN (LTE), TS 28.552 for 5G NR. TS 32.422 is for MDT, not PM counters.
PM counters are defined in TS 32.425 (LTE) and TS 28.552 (5G NR).
04

UE Traces (L3 Messages)

Call-level signaling detail for root cause analysis

UE traces capture every Layer 3 signaling message for individual calls or sessions: RRC Setup/Release, Measurement Reports, Handover Commands, NAS Attach/Detach. Unlike PM counters (aggregated per cell), traces are per-subscriber and per-call, enabling precise root cause analysis. A single problematic call generates 50-200 L3 messages. Vendors: Huawei CHR (Call History Record), Ericsson CTR, Nokia CellTrace.

L3 Message Sequence — RRC and NAS signaling for a single call
Per-Call
Granularity
50-200
Messages/Call
RRC/NAS
Protocol Layers
GB/hr
Volume
Hands-On TaskDiagnose from L3 Trace

A call trace shows: RRCSetup > MeasReport (RSRP=-112) > HandoverCommand > no HandoverComplete > RRCReestablishment. What happened?

Quick Quiz
What is the key difference between PM counters and UE traces?
APM counters are more detailed
BPM counters are aggregated per cell; traces are per-call with individual signaling messages
CTraces are collected more frequently
DPM counters include GPS locations
Correct! PM counters give aggregated cell-level statistics; traces provide per-call signaling detail for root cause analysis.
PM counters = aggregated cell statistics. Traces = per-call signaling messages. Different granularities for different use cases.
05

OSS/BSS Logs

Configuration, inventory, and operational data

The OSS (Operations Support System) contains the network's configuration database: cell parameters, neighbor relations, antenna settings, feature activations. The BSS (Business Support System) holds subscriber data, service plans, billing. For ML, OSS provides the configuration context that explains why a cell behaves the way it does. A cell with downtilt=8 and power=43dBm behaves differently from downtilt=4 and power=46dBm — OSS tells you which is which.

OSS Data Pipeline — Configuration, inventory, and topology flowing to ML models
Config
Parameters
Inventory
HW/SW
Topology
Neighbor Rels
Change
Audit Logs
Hands-On TaskFeature Engineering from OSS

Which OSS data creates the most useful ML features for coverage prediction?

Quick Quiz
Why is OSS data critical for ML models even though it changes infrequently?
AIt provides the configuration context that explains cell behavior differences
BIt has the highest data volume
CIt updates every 15 minutes
DIt replaces PM counters
Correct! Without OSS context, two cells with identical KPIs might have completely different root causes. Antenna tilt, power, neighbor relations — these explain WHY a cell behaves the way it does.
OSS provides configuration context (tilt, power, neighbors) that explains why cells with similar KPIs may have different root causes.
06

Call Detail Records (CDRs)

3GPP TS 32.298 — Billing records repurposed for analytics

CDRs were originally designed for billing but are now a goldmine for AI. Every voice call, data session, and SMS generates a CDR containing: start/end time, duration, serving cell, data volume, release cause, QoS class, and IMSI/MSISDN (anonymized for ML). CDRs enable subscriber-level analytics: churn prediction, usage pattern clustering, and service quality scoring.

CDR Waterfall — Session records flowing through the billing and analytics pipeline
Per-Session
Granularity
100M+
Records/Day
Billing
Original Purpose
Churn
ML Use Case
Hands-On TaskExtract Churn Features from CDRs

Which CDR-derived feature is most predictive of subscriber churn?

Quick Quiz
CDRs provide which unique perspective that PM counters cannot?
ACell-level aggregated KPIs
BSubscriber-level experience and usage patterns
CRF measurements like RSRP
DReal-time signaling messages
Correct! CDRs link network performance to individual subscribers, enabling churn prediction, subscriber segmentation, and personalized QoS analysis.
CDRs provide subscriber-level data — linking network events to individual users for churn and experience analytics.
07

Alarm & Event Logs

3GPP TS 32.111 — Network fault indicators

Alarms are the network's cry for help. Defined in 3GPP TS 32.111, they come in four severity levels: Critical, Major, Minor, Warning. A typical network generates 10K-100K alarms per day. The challenge for ML: alarm storms. A single fiber cut can generate 500+ correlated alarms across multiple cells. AI alarm correlation reduces noise by 80-90%, surfacing only root-cause alarms.

Alarm Timeline — Severity-coded alarms with storm detection
Critical
Highest Severity
10K+
Alarms/Day
-90%
AI Noise Reduction
RCA
Root Cause Use
Hands-On TaskIdentify the Root Cause Alarm

An alarm storm: 5 cells lost, VSWR alarm on Cell A, link failure on all 5. What is the root cause?

Quick Quiz
What is the main challenge of using alarm data for ML?
AAlarm storms create massive noise; one root cause triggers hundreds of correlated alarms
BAlarms are too infrequent
CAlarms lack timestamps
DAlarms are not standardized
Correct! A single fiber cut can trigger 500+ alarms. ML alarm correlation identifies the root cause and suppresses the noise, reducing operator workload by 80-90%.
The main challenge is alarm storms: one root cause triggering hundreds of correlated alarms, creating massive noise for operators.
08

Network Probes / DPI

Deep Packet Inspection for application-level analytics

DPI probes sit at strategic points in the network (Gn/S1-U, SGi/N6) and classify traffic by application: YouTube, Netflix, WhatsApp calls, gaming, OS updates. This gives operators visibility into what is consuming bandwidth, not just how much. For ML, DPI enables traffic-type-aware capacity planning, QoE prediction (video buffering ratio, voice MOS), and application-specific optimization.

DPI Application Mix — Real-time traffic classification by application type
App-Level
Classification
QoE
Metrics
S1-U/N3
Probe Point
100+
Apps Classified
Hands-On TaskTraffic Mix Analysis

If video is 65% of traffic but only 20% of subscribers, what does this tell the AI model?

Quick Quiz
What unique insight does DPI provide that no other data source can?
ARF signal strength
BCell-level KPIs
CApplication-level traffic classification and QoE metrics
DSubscriber billing information
Correct! DPI is the only source that tells you WHAT applications are using the bandwidth and HOW subscribers experience them (buffering, MOS, latency).
DPI uniquely provides application-level traffic classification — knowing what apps consume bandwidth and how users experience them.

Final Assessment

10 questions on telecom data sources for AI

1. Which data source is defined in 3GPP TS 32.422?
AMDT (Minimization of Drive Tests)
BPM Counters
CCDRs
DAlarms
Correct! TS 32.422 defines MDT.
TS 32.422 defines MDT (Minimization of Drive Tests).
2. PM counters are typically collected every:
A1 second
B15 minutes
C1 hour
D24 hours
Correct! 15 minutes is the default PM reporting interval.
The standard PM collection interval is 15 minutes.
3. Which data source provides per-subscriber session details?
APM Counters
BMDT Logs
CCDRs
DAlarms
Correct! CDRs contain per-subscriber, per-session details.
CDRs provide per-subscriber session-level data.
4. What is the main challenge with alarm data for ML?
AAlarm storms (one root cause triggers hundreds of alarms)
BAlarms are too rare
CAlarms lack severity levels
DAlarms are not time-stamped
Correct! Alarm storms are the main challenge.
Alarm storms — one event triggering hundreds of correlated alarms — are the main challenge.
5. DPI probes classify traffic by:
ARF signal strength
BApplication type (YouTube, WhatsApp, etc.)
CCell ID
DSubscriber plan
Correct! DPI classifies by application.
DPI classifies traffic by application type.
6. Drive tests capture scanner data. What does this mean?
AAll visible cells are measured, not just the serving cell
BOnly the serving cell is measured
CIt scans documents
DIt measures only downlink
Correct! Scanners see all cells, enabling interference analysis.
Scanner data means ALL visible cells are measured at each point.
7. OSS data is most useful for ML because it provides:
AReal-time traffic data
BConfiguration context (antenna tilt, power, neighbors)
CSubscriber behavior
DApplication classification
Correct! OSS provides the configuration context that explains cell behavior.
OSS provides configuration context — explaining WHY cells behave differently.
8. UE traces differ from PM counters in that traces are:
APer-call with individual L3 signaling messages
BAggregated per cell
CCollected every 15 minutes
DOnly available for 5G
Correct! Traces are per-call with individual signaling messages.
UE traces are per-call/per-session with individual L3 messages.
9. Which data source is most cost-effective for coverage monitoring?
ADrive tests ($1K+/day)
BMDT (free, crowdsourced from subscriber UEs)
CDPI probes
DCDRs
Correct! MDT uses subscriber UEs as free measurement probes.
MDT is free — it crowdsources measurements from subscriber UEs.
10. For churn prediction, which combination of data sources is most powerful?
AMDT + Drive Tests
BPM Counters + Alarms
CCDRs + PM Counters + DPI (subscriber behavior + network quality + app experience)
DOSS + Traces
Correct! Churn needs subscriber-level data (CDRs), network quality context (PM), and experience metrics (DPI) together.
Churn prediction requires CDRs (subscriber behavior) + PM (network quality) + DPI (app experience).

Master Telecom Data Engineering

Professional courses with real datasets and hands-on labs

Browse All Courses
AK
Abhijeet Kumar
Telecom AI Researcher · Building the future of network intelligence at CafeTele

Comments