What is the Object of Analysis?
The world is composed of objects, each with different attributes. Understanding objects and their attributes is the foundation of all data analysis.
For example, consider students as objects. A student has attributes such as name, age, grade, and score. Composing multiple students into a table forms the main object of data analysis:
| Name | Age | Grade | Score |
|---|---|---|---|
| Alice | 18 | 12 | 95 |
| Bob | 17 | 11 | 88 |
| Carol | 18 | 12 | 92 |
A table is divided into rows and columns. Each column represents the values of all objects under a specific attribute, with the same data type for the column. Each row represents an object, with potentially different data types across the row.
This tabular representationโwhere each row is an observation and each column is a variableโis the fundamental data structure in data science. Understanding this structure is essential before moving to more complex analyses.
Types of Data
- Numerical Data: Continuous (height, weight) or discrete (count)
- Categorical Data: Nominal (gender, color) or ordinal (education level)
- Temporal Data: Dates and timestamps
- Text Data: Unstructured data requiring processing
- Spatial Data: Geographic coordinates and shapes
- Image Data: Pixel data requiring computer vision techniques
- Graph Data: Network structures with nodes and edges
- Multimedia: Audio, video, and combined media
Data Types in 2026
Modern data science handles additional complex types:
| Data Type | Examples | Tools |
|---|---|---|
| Time Series | Stock prices, IoT sensors | Prophet, StatsModels |
| Text | Reviews, documents | Transformers, BERT |
| Images | Medical scans, photos | PyTorch, TensorFlow |
| Audio | Speech, music | Librosa, Whisper |
| Video | Surveillance, streaming | OpenCV, FFmpeg |
| 3D Point Clouds | Lidar, CAD | Open3D, PCL |
What Do We Analyze?
Analysis is primarily conducted from these dimensions: time dimension, geographical space dimension, and group (object or object group) dimension.
Time Dimension
The time dimension allows us to compare past, present, and future. Questions in this dimension include:
- How has sales performance changed over the last year?
- What is the trend in website traffic?
- What is the seasonal pattern in product demand?
- What will future demand likely be?
Time series analysis uses techniques like moving averages, exponential smoothing, and ARIMA models to identify trends and forecast future values.
# Time series forecasting with Python
import pandas as pd
from prophet import Prophet
# Prepare data
df = pd.DataFrame({
'ds': pd.date_range('2024-01-01', periods=365, freq='D'),
'y': sales_data # Your sales data
})
# Train model
model = Prophet(yearly_seasonality=True, weekly_seasonality=True)
model.fit(df)
# Forecast
future = model.make_future_dataframe(periods=30)
forecast = model.predict(future)
Geographical Space Dimension
This dimension mainly refers to regions, measured technically using latitude and longitude. Geographic analysis enables:
- Mapping customer distribution
- Analyzing regional performance differences
- Finding optimal store locations
- Understanding geographic patterns in data
Geographic analysis uses techniques like choropleth maps, heat maps, and spatial clustering to reveal geographic patterns.
# Geographic analysis
import geopandas as gpd
import folium
# Create map
m = folium.Map(location=[39.8283, -98.5795], zoom_start=4)
# Add choropleth
folium.Choropleth(
geo_data=us_states,
name='Population Density',
data=census_data,
columns=['State', 'Density'],
key_on='feature.properties.name',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2
).add_to(m)
Group Dimension
This is an abstracted concept. A group is abstract; two objects can be compared, or two different groups of objects. Any distinguishable and classifiable groups can be used for comparison:
- Comparing different customer segments
- A/B testing between control and treatment groups
- Comparing performance across departments
- Analyzing demographic differences
Group analysis often uses statistical tests (t-tests, chi-square tests) to determine if differences between groups are statistically significant.
What Are the Methods of Analysis?
Data analysis uses several fundamental operations, often implemented in spreadsheet software or programming languages:
Basic Operations
-
Sorting: Arrange data in ascending or descending order
df.sort_values('score', ascending=False) -
Summing: Calculate total values
df['score'].sum() -
Averaging: Calculate mean values
df['score'].mean() -
Maximum/Minimum: Find extreme values
df['score'].max() df['score'].min() -
Counting: Count occurrences
df['grade'].value_counts()
Advanced Operations
-
Grouping: Aggregate data by categories
df.groupby('grade')['score'].mean() df.groupby('grade').agg({'score': ['mean', 'sum', 'count']}) -
Apply: Apply custom functions to columns
df['score'].apply(custom_function) df.apply(lambda x: x.max() - x.min(), axis=1) -
Pivot: Create contingency tables / pivot tables / cross-tabulation
pd.pivot_table(df, values='score', index='grade', columns='year', aggfunc='mean') -
Join: Connect multiple tables
pd.merge(students, courses, on='student_id', how='left') -
Window Functions: Calculate running statistics
df['rolling_avg'] = df['value'].rolling(window=7).mean() df['lag_1'] = df['value'].shift(1)
Statistical Methods
Beyond basic operations, data science employs statistical methods:
- Descriptive Statistics: Mean, median, mode, standard deviation, variance
- Inferential Statistics: Hypothesis testing, confidence intervals, p-values
- Regression Analysis: Linear regression, polynomial regression, ridge/lasso
- Classification: Decision trees, random forests, SVM, neural networks
- Clustering: K-means, DBSCAN, hierarchical clustering
- Dimensionality Reduction: PCA, t-SNE, UMAP
- Time Series: ARIMA, SARIMA, Prophet, LSTM
Machine Learning Methods
Modern data science heavily uses machine learning:
# Scikit-learn workflow
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Prepare data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluate
predictions = model.predict(X_test_scaled)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
Data Visualization
Many data rows and points are not intuitive, so we need to convert data into graphics for better understanding.
Common Visualizations
- Bar Charts: Compare categorical data
- Line Charts: Show trends over time
- Scatter Plots: Reveal relationships between variables
- Histograms: Display distribution of numerical data
- Box Plots: Show statistical summaries and outliers
- Heat Maps: Display density or intensity
- Violin Plots: Combine box plot with density estimation
- Pair Plots: Show relationships across multiple variables
- Sunburst Charts: Hierarchical data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_theme(style="whitegrid")
# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Bar chart
sns.barplot(x='grade', y='score', data=df, ax=axes[0, 0])
axes[0, 0].set_title('Average Score by Grade')
# Scatter plot
sns.scatterplot(x='age', y='score', hue='grade', data=df, ax=axes[0, 1])
axes[0, 1].set_title('Score vs Age')
# Box plot
sns.boxplot(x='grade', y='score', data=df, ax=axes[1, 0])
axes[1, 0].set_title('Score Distribution by Grade')
# Heat map
correlation = df.corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', ax=axes[1, 1])
axes[1, 1].set_title('Feature Correlation')
plt.tight_layout()
plt.show()
Interactive Visualizations
Modern dashboards use interactive tools:
# Plotly for interactive charts
import plotly.express as px
fig = px.scatter(df, x='age', y='score',
color='grade',
size='score',
hover_data=['name'],
title='Student Performance')
fig.show()
# Create dashboard with Streamlit
import streamlit as st
st.title('Sales Dashboard')
st.line_chart(sales_data)
st.bar_chart(category_sales)
Additional Insights
Data science builds on these fundamentals by incorporating statistical modeling, machine learning algorithms, and domain expertise. Key tools include:
Programming Languages
- Python: With libraries like Pandas, NumPy, SciPy, Matplotlib, Seaborn, Scikit-learn, TensorFlow, PyTorch
- R: With packages like ggplot2, dplyr, tidyr, caret, tidymodels
- SQL: For database queries and data manipulation
- Julia: High-performance numerical computing
- Scala: Spark for big data processing
Key Python Libraries
import pandas as pd # Data manipulation
import numpy as np # Numerical computing
import matplotlib.pyplot as plt # Visualization
import seaborn as sns # Statistical visualization
from sklearn import * # Machine learning
import scipy.stats as stats # Statistics
import statsmodels.api as sm # Statistical models
import plotly.express as px # Interactive visualizations
The Data Science Process
- Define Question: What problem are you solving?
- Collect Data: Gather relevant data
- Clean Data: Handle missing values, outliers
- Explore Data: Initial analysis and visualization
- Feature Engineering: Create new features from existing data
- Model Data: Apply statistical/ML models
- Evaluate: Assess model performance
- Interpret Results: Draw conclusions
- Communicate Findings: Present results clearly
- Deploy: Put model into production
Ensuring Data Quality
- Always ensure data quality through cleaning and preprocessing
- Handle missing values appropriately (impute, drop, or flag)
- Check for outliers and errors
- Validate data types and ranges
- Document data sources and transformations
- Avoid biased results through careful sampling
- Test for data drift in production
Modern Data Science in 2026
Key Trends
| Trend | Description | Tools |
|---|---|---|
| AutoML | Automated machine learning | AutoGluon, H2O, Auto-sklearn |
| MLOps | Production ML workflows | MLflow, Kubeflow, Vertex AI |
| LLMs | Large language models | GPT-4, Claude, Llama |
| RAG | Retrieval-augmented generation | LangChain, LlamaIndex |
| Feature Stores | Centralized feature management | Feast, Tecton |
| Data Mesh | Decentralized data architecture | Domain-oriented data |
Cloud Platforms
# AWS SageMaker
import sagemaker
from sagemaker import get_execution_role
# Azure ML
from azureml.core import Workspace, Dataset
# Google Cloud Vertex AI
from google.cloud import aiplatform
MLOps Pipeline
# Example MLOps pipeline
stages:
- data_preparation:
- validate_data
- clean_data
- feature_engineering
- model_training:
- train_model
- evaluate_model
- register_model
- model_deployment:
- deploy_to_endpoint
- monitor_model
- detect_drift
Conclusion
Data science is built on the foundation of understanding objects and their attributes, organizing them into tabular form, and analyzing them across multiple dimensions (time, space, and group). The basic operations of sorting, summing, averaging, grouping, pivoting, and joining provide the toolkit for exploration, while statistical modeling and machine learning enable prediction and classification.
Master these fundamentals before moving to advanced techniques. The principles of data visualization ensure findings are communicated clearly. With proper tools and methodologies, data science transforms raw data into actionable insights.
Key takeaways:
- Understand your data types before analysis
- Use appropriate visualizations for different data types
- Follow the data science process systematically
- Ensure data quality at every stage
- Keep up with evolving tools and trends in 2026
- Consider MLOps for production-ready models
Related Topics
- Data Sources for Machine Learning
- Machine Learning Algorithms
- Statistics Fundamentals
- Data Engineering
- MLOps Best Practices
Comments