Real Data Sources for AI Models in Telecom

Every machine learning model is only as good as its training data. In telecom, AI systems consume data from 8 distinct sources, each with different formats, volumes, latencies, and use cases. An engineer who understands these data sources can build better features, identify data quality issues faster, and design ML pipelines that actually work in production. This article takes you inside each data source — what it contains, how to access it, how to engineer features from it, and what pitfalls to watch for.

Data Sources

TB/day

Typical Volume

50+

ML Features

3GPP

Standardized

Raw Data

→

ETL Pipeline

→

Feature Eng.

→

ML Model

MDT Logs (Minimization of Drive Tests)

3GPP TS 32.422 — UE-reported RF measurements at scale

MDT allows operators to collect RF measurements (RSRP, RSRQ, SINR) directly from subscriber UEs without deploying drive test teams. Defined in 3GPP TS 32.422, MDT comes in two modes: Immediate MDT (real-time reporting for connected UEs) and Logged MDT (UE stores measurements in idle mode and uploads later). A single cell can generate 10K-100K MDT samples per day, providing coverage maps that no drive test could match.

1. UE Configuration

eNB/gNB sends MeasConfig to selected UEs. Logged MDT: UE stores RSRP/RSRQ with GPS coordinates and timestamps while in RRC_IDLE. Immediate MDT: UE reports during active sessions.

2. Collection

Logged MDT data uploaded when UE re-enters RRC_CONNECTED. eNB forwards to Trace Collection Entity (TCE). Volume: 100-500 bytes per sample, 50K-500K samples/cell/day.

3. Geo-Correlation

GPS coordinates mapped to coverage grid (50m x 50m bins). Aggregate RSRP_mean, RSRP_p5 (5th percentile), RSRQ distribution per bin. This creates a "crowdsourced coverage map."

4. ML Features

Coverage holes (RSRP < -110 dBm), weak coverage areas, overshooting cells (serving cell RSRP < neighbor), indoor vs. outdoor classification from altitude/speed.

MDT Coverage Heatmap — UE-reported RSRP measurements aggregated per 50m grid

100K+

Samples/Cell/Day

GPS

Geo-Tagged

Free

No Drive Test Cost

TS 32.422

3GPP Standard

Hands-On TaskIdentify the Coverage Problem

Given these MDT statistics for a cell, identify the most likely RF issue.

Quick Quiz

What is the key advantage of MDT over traditional drive testing?

AHigher accuracy GPS

BMassive scale: thousands of UEs reporting simultaneously vs. one drive test vehicle

CReal-time reporting only

DIt replaces all other data sources

Correct! MDT uses subscriber UEs as measurement probes, giving coverage data from thousands of locations simultaneously — far beyond what any drive test team could achieve.

MDT's key advantage is scale: thousands of UEs report measurements simultaneously, creating coverage maps no drive test could match.

Drive Test Data

Controlled RF measurements with professional equipment

Drive tests use professional UE scanners (TEMS, Nemo, XCAL) mounted in vehicles to collect high-resolution RF measurements along predefined routes. Unlike MDT, drive tests capture scanner data (all cells visible, not just serving), layer-3 messages (RRC, NAS), and application-level metrics (throughput, latency, MOS). The tradeoff: expensive ($500-2000/day) but ground truth for model validation.

Drive Test Route — Color-coded RSRP measurements along the test path

Message Capture

10ms

Sample Rate

$1K+

Cost/Day

Scanner

All Cells Visible

Hands-On TaskDrive Test vs. MDT: When to Use Which?

Match the scenario to the right data source.

Quick Quiz

What can drive tests capture that MDT cannot?

ARSRP measurements

BScanner data showing all visible cells (not just serving)

CGPS coordinates

DTime stamps

Correct! Drive test scanners see ALL cells at each measurement point, enabling interference analysis and neighbor audits. MDT only reports serving + configured neighbor cells.

Drive test scanners capture all visible cells (serving + non-serving), enabling interference analysis that MDT cannot provide.

RAN PM Counters

3GPP TS 32.401/425 — The backbone of network analytics

Performance Management (PM) counters are the most widely used data source in telecom AI. Every eNB/gNB reports hundreds of counters every 15 minutes (configurable to 5 min): RRC.ConnEstab.Att, ERAB.EstabInit.Succ, HO.ExeSucc, PRB.Used.DL.Avg, DL.THP.Time. These counters feed KPI calculations (HOSR, Drop Rate, CSSR, Throughput) and are the primary input for most ML models.

Counter	Full Name	Granularity	ML Use
RRC.ConnEstab.Att	RRC Connection Attempts	Cell/15min	Accessibility prediction
ERAB.RelAbnormal.RLF	Abnormal Release (Radio Link Failure)	Cell/15min	Call drop prediction
HO.ExeSucc / HO.ExeAtt	Handover Success/Attempts	Cell-pair/15min	HOSR optimization
PRB.Used.DL.Avg	Average DL PRB Utilization	Cell/15min	Capacity prediction
DL.THP.Time	DL Throughput Time	Cell/15min	Throughput prediction

PM Counter Dashboard — Real-time KPI trends from RAN counters

500+

Counters/Cell

15 min

Default Granularity

1M+

Records/Day

KPIs

Derived From

Hands-On TaskCalculate a KPI from Raw Counters

Given counters, calculate the Handover Success Rate (HOSR).

HO.ExeSucc1800

HO.ExeAtt1900

HOSR = 94.7% | Target: 98% | Status: Below Target

Quick Quiz

Which 3GPP specification defines PM counter collection for RAN?

ATS 38.331

BTS 32.425 (PM for E-UTRAN) / TS 28.552 (5G)

CTS 23.501

DTS 32.422

Correct! TS 32.425 defines PM counters for E-UTRAN (LTE), TS 28.552 for 5G NR. TS 32.422 is for MDT, not PM counters.

PM counters are defined in TS 32.425 (LTE) and TS 28.552 (5G NR).

UE Traces (L3 Messages)

Call-level signaling detail for root cause analysis

UE traces capture every Layer 3 signaling message for individual calls or sessions: RRC Setup/Release, Measurement Reports, Handover Commands, NAS Attach/Detach. Unlike PM counters (aggregated per cell), traces are per-subscriber and per-call, enabling precise root cause analysis. A single problematic call generates 50-200 L3 messages. Vendors: Huawei CHR (Call History Record), Ericsson CTR, Nokia CellTrace.

L3 Message Sequence — RRC and NAS signaling for a single call

Per-Call

Granularity

50-200

Messages/Call

RRC/NAS

Protocol Layers

GB/hr

Volume

Hands-On TaskDiagnose from L3 Trace

A call trace shows: RRCSetup > MeasReport (RSRP=-112) > HandoverCommand > no HandoverComplete > RRCReestablishment. What happened?

Quick Quiz

What is the key difference between PM counters and UE traces?

APM counters are more detailed

BPM counters are aggregated per cell; traces are per-call with individual signaling messages

CTraces are collected more frequently

DPM counters include GPS locations

Correct! PM counters give aggregated cell-level statistics; traces provide per-call signaling detail for root cause analysis.

PM counters = aggregated cell statistics. Traces = per-call signaling messages. Different granularities for different use cases.

OSS/BSS Logs

Configuration, inventory, and operational data

The OSS (Operations Support System) contains the network's configuration database: cell parameters, neighbor relations, antenna settings, feature activations. The BSS (Business Support System) holds subscriber data, service plans, billing. For ML, OSS provides the configuration context that explains why a cell behaves the way it does. A cell with downtilt=8 and power=43dBm behaves differently from downtilt=4 and power=46dBm — OSS tells you which is which.

OSS Data Pipeline — Configuration, inventory, and topology flowing to ML models

Config

Parameters

Inventory

HW/SW

Topology

Neighbor Rels

Change

Audit Logs

Hands-On TaskFeature Engineering from OSS

Which OSS data creates the most useful ML features for coverage prediction?

Quick Quiz

Why is OSS data critical for ML models even though it changes infrequently?

AIt provides the configuration context that explains cell behavior differences

BIt has the highest data volume

CIt updates every 15 minutes

DIt replaces PM counters

Correct! Without OSS context, two cells with identical KPIs might have completely different root causes. Antenna tilt, power, neighbor relations — these explain WHY a cell behaves the way it does.

OSS provides configuration context (tilt, power, neighbors) that explains why cells with similar KPIs may have different root causes.

Call Detail Records (CDRs)

3GPP TS 32.298 — Billing records repurposed for analytics

CDRs were originally designed for billing but are now a goldmine for AI. Every voice call, data session, and SMS generates a CDR containing: start/end time, duration, serving cell, data volume, release cause, QoS class, and IMSI/MSISDN (anonymized for ML). CDRs enable subscriber-level analytics: churn prediction, usage pattern clustering, and service quality scoring.

CDR Waterfall — Session records flowing through the billing and analytics pipeline

Per-Session

Granularity

100M+

Records/Day

Billing

Original Purpose

Churn

ML Use Case

Hands-On TaskExtract Churn Features from CDRs

Which CDR-derived feature is most predictive of subscriber churn?

Quick Quiz

CDRs provide which unique perspective that PM counters cannot?

ACell-level aggregated KPIs

BSubscriber-level experience and usage patterns

CRF measurements like RSRP

DReal-time signaling messages

Correct! CDRs link network performance to individual subscribers, enabling churn prediction, subscriber segmentation, and personalized QoS analysis.

CDRs provide subscriber-level data — linking network events to individual users for churn and experience analytics.

Alarm & Event Logs

3GPP TS 32.111 — Network fault indicators

Alarms are the network's cry for help. Defined in 3GPP TS 32.111, they come in four severity levels: Critical, Major, Minor, Warning. A typical network generates 10K-100K alarms per day. The challenge for ML: alarm storms. A single fiber cut can generate 500+ correlated alarms across multiple cells. AI alarm correlation reduces noise by 80-90%, surfacing only root-cause alarms.

Alarm Timeline — Severity-coded alarms with storm detection

Critical

Highest Severity

10K+

Alarms/Day

-90%

AI Noise Reduction

RCA

Root Cause Use

Hands-On TaskIdentify the Root Cause Alarm

An alarm storm: 5 cells lost, VSWR alarm on Cell A, link failure on all 5. What is the root cause?

Quick Quiz

What is the main challenge of using alarm data for ML?

AAlarm storms create massive noise; one root cause triggers hundreds of correlated alarms

BAlarms are too infrequent

CAlarms lack timestamps

DAlarms are not standardized

Correct! A single fiber cut can trigger 500+ alarms. ML alarm correlation identifies the root cause and suppresses the noise, reducing operator workload by 80-90%.

The main challenge is alarm storms: one root cause triggering hundreds of correlated alarms, creating massive noise for operators.

Network Probes / DPI

Deep Packet Inspection for application-level analytics

DPI probes sit at strategic points in the network (Gn/S1-U, SGi/N6) and classify traffic by application: YouTube, Netflix, WhatsApp calls, gaming, OS updates. This gives operators visibility into what is consuming bandwidth, not just how much. For ML, DPI enables traffic-type-aware capacity planning, QoE prediction (video buffering ratio, voice MOS), and application-specific optimization.

DPI Application Mix — Real-time traffic classification by application type

App-Level

Classification

QoE

Metrics

S1-U/N3

Probe Point

100+

Apps Classified

Hands-On TaskTraffic Mix Analysis

If video is 65% of traffic but only 20% of subscribers, what does this tell the AI model?

Quick Quiz

What unique insight does DPI provide that no other data source can?

ARF signal strength

BCell-level KPIs

CApplication-level traffic classification and QoE metrics

DSubscriber billing information

Correct! DPI is the only source that tells you WHAT applications are using the bandwidth and HOW subscribers experience them (buffering, MOS, latency).

DPI uniquely provides application-level traffic classification — knowing what apps consume bandwidth and how users experience them.

Final Assessment

10 questions on telecom data sources for AI

1. Which data source is defined in 3GPP TS 32.422?

AMDT (Minimization of Drive Tests)

BPM Counters

CCDRs

DAlarms

Correct! TS 32.422 defines MDT.

TS 32.422 defines MDT (Minimization of Drive Tests).

2. PM counters are typically collected every:

A1 second

B15 minutes

C1 hour

D24 hours

Correct! 15 minutes is the default PM reporting interval.

The standard PM collection interval is 15 minutes.

3. Which data source provides per-subscriber session details?

APM Counters

BMDT Logs

CCDRs

DAlarms

Correct! CDRs contain per-subscriber, per-session details.

CDRs provide per-subscriber session-level data.

4. What is the main challenge with alarm data for ML?

AAlarm storms (one root cause triggers hundreds of alarms)

BAlarms are too rare

CAlarms lack severity levels

DAlarms are not time-stamped

Correct! Alarm storms are the main challenge.

Alarm storms — one event triggering hundreds of correlated alarms — are the main challenge.

5. DPI probes classify traffic by:

ARF signal strength

BApplication type (YouTube, WhatsApp, etc.)

CCell ID

DSubscriber plan

Correct! DPI classifies by application.

DPI classifies traffic by application type.

6. Drive tests capture scanner data. What does this mean?

AAll visible cells are measured, not just the serving cell

BOnly the serving cell is measured

CIt scans documents

DIt measures only downlink

Correct! Scanners see all cells, enabling interference analysis.

Scanner data means ALL visible cells are measured at each point.

7. OSS data is most useful for ML because it provides:

AReal-time traffic data

BConfiguration context (antenna tilt, power, neighbors)

CSubscriber behavior

DApplication classification

Correct! OSS provides the configuration context that explains cell behavior.

OSS provides configuration context — explaining WHY cells behave differently.

8. UE traces differ from PM counters in that traces are:

APer-call with individual L3 signaling messages

BAggregated per cell

CCollected every 15 minutes

DOnly available for 5G

Correct! Traces are per-call with individual signaling messages.

UE traces are per-call/per-session with individual L3 messages.

9. Which data source is most cost-effective for coverage monitoring?

ADrive tests ($1K+/day)

BMDT (free, crowdsourced from subscriber UEs)

CDPI probes

DCDRs

Correct! MDT uses subscriber UEs as free measurement probes.

MDT is free — it crowdsources measurements from subscriber UEs.

10. For churn prediction, which combination of data sources is most powerful?

AMDT + Drive Tests

BPM Counters + Alarms

CCDRs + PM Counters + DPI (subscriber behavior + network quality + app experience)

DOSS + Traces

Correct! Churn needs subscriber-level data (CDRs), network quality context (PM), and experience metrics (DPI) together.

Churn prediction requires CDRs (subscriber behavior) + PM (network quality) + DPI (app experience).

Abhijeet Kumar

Telecom AI Researcher · Building the future of network intelligence at CafeTele

Next in the Series

Day 7: AI vs SON (Self-Organizing Networks) →

Real Data Sourcesfor AI Models

MDT Logs (Minimization of Drive Tests)

1. UE Configuration

2. Collection

3. Geo-Correlation

4. ML Features

Drive Test Data

RAN PM Counters

UE Traces (L3 Messages)

OSS/BSS Logs

Call Detail Records (CDRs)

Alarm & Event Logs

Network Probes / DPI

Final Assessment

Master Telecom Data Engineering

Comments

Real Data Sources
for AI Models