Introduction
High-quality training data is critical for machine learning success. This guide compares three leading data labeling solutions: Label Studio (open-source), Scale AI (managed service), and Snorkel (programmatic labeling).
Understanding Data Labeling
Data labeling types:
- Image: Classification, detection, segmentation
- Text: NER, sentiment, classification
- Audio: Transcription, speaker diarization
- Video: Tracking, activity recognition
- Document: OCR, entity extraction
# Data labeling workflow
workflow = {
"collect": "Gather raw data",
"annotate": "Add labels via humans/models",
"validate": "Quality assurance checks",
"export": "Convert to training format"
}
Label Studio: Open-Source Annotation
Label Studio is a free, open-source platform for labeling various data types.
Installation and Setup
# Install Label Studio
pip install label-studio
# Start Label Studio
label-studio start
# Or with Docker
docker run -it -p 9090:9090 -v $(pwd)/mydata:/label-studio/data \
heartexlabs/label-studio:latest
Label Studio XML Configuration
<!-- config.xml - Label Studio project config -->
<View>
<Header>Classify the sentiment</Header>
<Text name="text" value="$text"/>
<Choices name="sentiment" toName="text">
<Choice value="Positive"/>
<Choice value="Negative"/>
<Choice value="Neutral"/>
</Choices>
</View>
Python API
import label_studio_sdk as ls
# Connect to Label Studio
client = ls.Client(
url="http://localhost:9090",
api_key="your-api-key"
)
# Get project
project = client.get_project(1)
# Import data for labeling
project.import_data([
{"text": "I love this product!"},
{"text": "Terrible experience, would not recommend."},
{"text": "It's okay, nothing special."}
])
# Export annotations
annotations = project.export_tasks()
# Process annotations
for task in annotations:
text = task['data']['text']
label = task['annotations'][0]['result'][0]['value']['choices'][0]
print(f"Text: {text} -> Label: {label}")
Advanced Labeling Configurations
<!-- Named Entity Recognition -->
<View>
<Text name="text" value="$text"/>
<Labels name="entities" toName="text">
<Label value="PERSON" background="#FFabad"/>
<Label value="ORG" background="#FFD6A5"/>
<Label value="DATE" background="#CAFFBF"/>
</Labels>
<Rectangle name="bbox" toName="text"/>
</View>
<!-- Image Classification -->
<View>
<Image name="image" value="$image"/>
<Choices name="category" toName="image" perRow="3">
<Choice value="cat"/>
<Choice value="dog"/>
<Choice value="bird"/>
</Choices>
<KeyPointLabels name="points" toName="image">
<Label value="eye"/>
<Label value="nose"/>
</KeyPointLabels>
</View>
<!-- Audio Transcription -->
<View>
<Audio name="audio" value="$audio"/>
<TextArea name="transcription" toName="audio" rows="4"/>
</View>
Label Studio with Machine Learning
# Integrate ML backend for pre-labeling
from label_studio_sdk import MLBackend
class SentimentModel(MLBackend):
def predict(self, tasks, **kwargs):
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
results = []
for task in tasks:
text = task['data']['text']
pred = classifier(text)[0]
results.append([{
"from_name": "sentiment",
"to_name": "text",
"type": "choices",
"score": pred['score'],
"result": [{
"value": {"choices": [pred['label']]},
"from_name": "sentiment",
"to_name": "text"
}]
}])
return results
# Start ML backend
# label-studio-ml init sentiment_backend --from sentiment_model.py
# label-studio-ml run sentiment_backend
Scale AI: Managed Labeling Service
Scale AI provides managed labeling with enterprise features and workforce.
Scale AI SDK
from scaleapi import ScaleClient
client = ScaleClient("your-api-key")
# Create text classification project
project = client.create_project(
name="Sentiment Classification",
task_type="categorization",
categorization_labels=["Positive", "Negative", "Neutral"],
instruction="Classify the sentiment of the text"
)
# Upload data for labeling
task = client.create_task_from_json({
"project": project.id,
"instruction": "Classify the sentiment",
"data": {
"text": "This product is amazing!",
"id": "sample_001"
}
})
# Get annotation
annotation = task.annotation
print(annotation["category"])
Scale AI Complex Annotations
# Image bounding boxes
bbox_task = client.create_bbox_task(
project=project.id,
instruction="Draw boxes around all vehicles",
image_url="https://example.com/traffic.jpg",
with_labels=True,
labels=["car", "truck", "motorcycle", "bicycle"]
)
# Semantic segmentation
segment_task = client.create_segmentation_task(
project=segmentation_project.id,
instruction="Segment all road areas",
image_url="https://example.com/road.jpg",
mask=True
)
# Document OCR with fields
doc_task = client.create_document_task(
project=document_project.id,
instruction="Extract all fields from invoice",
document_url="https://example.com/invoice.pdf",
fields=[
{"name": "vendor", "type": "text"},
{"name": "date", "type": "date"},
{"name": "total", "type": "amount"},
{"name": "line_items", "type": "list"}
]
)
# Video annotation
video_task = client.create_video_task(
project=video_project.id,
instruction="Track all pedestrians",
video_url="https://example.com/video.mp4",
keyframe_intervals=[0, 30, 60, 90, 120]
)
Scale AI Quality Assurance
# Set up consensus testing
project = client.create_project(
name="NER with Consensus",
task_type="transcription",
consensusWorkerCount=3, # 3 workers per task
consensusPercentage=80 # Require 80% agreement
)
# Review and adjust annotations
task = client.get_task("task_id")
if task.status == "completed":
review = client.create_review(
task_id=task.id,
action="adjust", # approve, adjust, reject
adjusted_annotation=adjusted_data
)
Snorkel: Programmatic Labeling
Snorkel enables labeling data through code rather than manual annotation.
Snorkel Installation
pip install snorkel
Snorkel Basic Usage
from snorkel.labeling import LabelingFunction
import re
# Define labeling functions
@LabelingFunction()
def contains_positive_words(x):
positive_words = ["love", "great", "excellent", "amazing", "fantastic"]
if any(word in x.text.lower() for word in positive_words):
return 1 # Positive
return -1 # Abstain
@LabelingFunction()
def contains_negative_words(x):
negative_words = ["hate", "terrible", "awful", "horrible", "worst"]
if any(word in x.text.lower() for word in negative_words):
return 0 # Negative
return -1 # Abstain
@LabelingFunction()
def starts_with_i(x):
if x.text.lower().startswith("i "):
return 1 # Often positive in reviews
return -1
# Apply labeling functions
from snorkel.labeling import PandasLabelStudio
applier = PandasLabelStudio([contains_positive_words,
contains_negative_words,
starts_with_i])
labels = applier.apply(df)
print(labels)
# Output:
# array([[ 1, -1, 1],
# [-1, 0, -1],
# [ 1, 0, 1]])
Snorkel Label Model
from snorkel.labeling.model import LabelModel
from snorkel.analysis import metric_score
# Train label model
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(labels, n_epochs=500, lr=0.001)
# Get learned weights
print(label_model.get_weights())
# Generate probabilistic labels
proba = label_model.predict_proba(labels)
# Create training labels
train_labels = label_model.predict(labels)
Advanced Snorkel Functions
# Using external resources
from snorkel.labeling import LabelingFunction
import requests
def check_emoji(x):
positive_emoji = ["๐", "๐", "โค๏ธ", "๐"]
if any(emoji in x.text for emoji in positive_emoji):
return 1
return -1
# Using heuristics
@LabelingFunction()
def short_text(x):
if len(x.text) < 20:
return 0 # Short reviews often negative
return -1
@LabelingFunction()
def exclamation_heavy(x):
count = x.text.count("!")
if count > 2:
return 1 # Excitement
return -1
# Using regex patterns
@LabelingFunction()
def extract_rating(x):
match = re.search(r'(\d+)\s*(?:stars?|out of)', x.text)
if match:
rating = int(match.group(1))
if rating >= 4:
return 1
elif rating <= 2:
return 0
return -1
Snorkel for Images
from snorkel.augmentation import TFImageAugmentor
from snorkel.selector import FixedLengthSampler
# Image augmentation
augmentor = TFImageAugmentor(
partition="train",
maxscale=0.2,
rotate=15,
flip=True
)
# Apply to dataset
augmented = augmentor.apply(X_train)
Comparison
| Feature | Label Studio | Scale AI | Snorkel |
|---|---|---|---|
| Type | Open-source | Managed | Programmatic |
| Cost | Free (self-hosted) | Pay per label | Free |
| Annotation Types | All | All | Text, images |
| ML Integration | Pre-labeling | AutoML | Weak supervision |
| Quality Control | Manual | Built-in | Via label model |
| Scale | Manual + API | Massive workforce | Code-based |
When to Use Each
Label Studio
- Budget constraints
- Need full control
- Custom annotation needs
# Good: Quick annotation setup
project = client.create_project(...)
Scale AI
- Enterprise needs
- Large-scale labeling
- Fast turnaround
# Good: Scale without hiring annotators
task = client.create_task(...)
Snorkel
- Have domain expertise in code
- Need fast iteration
- Can define heuristics
# Good: Programmatic labeling
@LabelingFunction()
def my_rule(x):
# Your labeling logic
return label
Bad Practices to Avoid
Bad Practice 1: No Quality Checks
# Bad: Accept all labels
labels = export_data() # No validation
# Good: Add quality checks
for task in annotations:
if task['quality_score'] < 0.8:
send_for_review(task)
Bad Practice 2: Single Annotator
# Bad: One person labels everything
# Risk of bias and errors
# Good: Multiple annotators with consensus
project = client.create_project(
consensusWorkerCount=3,
consensusPercentage=80
)
Bad Practice 3: Ignoring Edge Cases
# Bad: Only labeling common cases
# Model won't handle rare inputs
# Good: Include edge cases
edge_cases = [
"Very short",
"Very long",
"Multiple languages",
"Mixed content"
]
Good Practices
Labeling Guidelines
# Good: Comprehensive guidelines
guidelines = """
## Sentiment Classification
### Positive (label: 1)
- Expresses satisfaction
- Recommends product
- Contains praise words
### Negative (label: 0)
- Expresses dissatisfaction
- Contains complaints
- Warns others
### Neutral (label: -1)
- Purely factual
- No sentiment expressed
- Questions only
"""
Active Learning
# Good: Smart sampling
from snorkel.analysis import Scorer
# Label uncertain samples first
uncertain_scores = 1 - proba.max(axis=1)
uncertain_indices = uncertain_scores.argsort()[-100:]
# Label these next
for idx in uncertain_indices:
send_for_labeling(data[idx])
Version Control
# Good: Track label versions
labels = export_annotations(version="2.1")
# Save version info
metadata = {
"version": "2.1",
"date": "2025-12-22",
"annotators": ["John", "Jane"],
"guidelines": "v2.0"
}
External Resources
- Label Studio Documentation
- Scale AI Documentation
- Snorkel Documentation
- Snorkel Flow
- Label Studio GitHub
- Snorkel Papers
- Data Labeling Best Practices
Comments