FedScope Employment Data Documentation

Getting Started

This dataset contains 140+ million records of federal civilian employees from 1998 to 2024. Each record represents one employee in a specific quarter with their job details, demographics, and compensation.

📊 Get the Data: You can either:

Download individual files directly from GitHub without cloning
Clone the full repository (3.7GB) - see the README for instructions

🚀 Quick Start: Run examples.py for comprehensive usage examples!

What's Included

72 quarterly Parquet files from March 1998 to September 2024
~1.7-2.3 million employees per quarterly snapshot
Most federal agencies (excludes uniformed military, postal service, intelligence)
52 fields per employee including demographics, job details, and pay
~2.3GB Parquet files (3.7GB total repository)

Useful For

Analyzing federal workforce trends over time
Studying compensation patterns by agency or occupation
Mapping government employment across states and regions
Understanding federal workforce demographics and changes

Privacy: All data is anonymized - no individual employees can be identified.

Understanding the Data Fields

Each employee record has 52 fields that fall into three categories:

Field Format Structure: The dataset contains three types of fields:

Code Fields: Short codes used for categorization (e.g., agelvl, occ)
Description Fields: Human-readable labels (e.g., agelvlt, occt)
Data Fields: Actual values for analysis (e.g., salary, employment)

Tip: Use the description fields (ending in 't') for most analysis - they're much easier to understand!

Field Categories

When & Where: Year, quarter, agency, location
Who: Age group, education level, length of service
What Job: Occupation, grade level, supervisory status
How Much: Salary, salary range
Work Details: Full/part-time, appointment type, work status

Selected Key Fields

Start with these key fields for most analyses:

Field	What It Shows	Example Values
`year`, `quarter`	When the data was collected	2024, "September"
`agysubt`	Which agency/department	"Department of Defense", "Internal Revenue Service"
`loct`	Where they work	"CALIFORNIA", "DISTRICT OF COLUMBIA"
`occt`	Their job title	"Accountant", "Computer Scientist", "Human Resources Specialist"
`patcot`	Job category	"Professional", "Administrative", "Technical"
`salary`	Annual salary (dollars)	65000, 95000, null (when private)
`sallvlt`	Salary range	"$60,000 - $69,999", "$90,000 - $99,999"
`agelvlt`	Age group	"25-29 YEARS", "45-49 YEARS"
`edlvlt`	Education level	"Bachelor's Degree", "Master's Degree"
`employment`	Person count (always 1)	1

Important Data Handling:

Each row = one employee. Sum up employment to count total people.
Numeric fields like employment and salary are stored as strings
Salary values of ***** are redacted and should be filtered out
See examples.py for proper conversion techniques

Code Examples

🚀 Complete Examples: Run examples.py for comprehensive usage examples with both local and download methods!

Output is saved to examples_output.txt
Includes DuckDB examples for querying multiple years efficiently

Basic Usage Patterns

Load Data

# Load from GitHub (without cloning)
df = pd.read_parquet('https://github.com/abigailhaddad/fedscope_employment/raw/main/fedscope_data/parquet/fedscope_employment_September_2024.parquet')

# Load locally (if you've cloned the repo)
df = pd.read_parquet('fedscope_data/parquet/fedscope_employment_September_2024.parquet')

Count Employees by Agency

# Employment is stored as strings, convert properly
agency_counts = df.groupby('agysubt')['employment'].apply(
    lambda x: sum(int(i) for i in x)
).sort_values(ascending=False).head(10)

Analyze Salaries by Education

# Convert salary to numeric, handling edge cases
df['salary_numeric'] = df['salary'].apply(lambda x: int(float(x)) if x not in [None, 'nan', '*****', ''] and pd.notna(x) else None)
df_with_salary = df[df['salary_numeric'].notna()]
salary_by_edu = df_with_salary.groupby('edlvlt')['salary_numeric'].mean().sort_values(ascending=False)

Track Workforce Over Time

# Group by time periods
quarterly = df.groupby(['year', 'quarter'])['employment'].apply(
    lambda x: sum(int(i) for i in x)
)

💡 Key Points:

Numeric fields are stored as strings - convert with int(i)
Filter salary data: df['salary'] != '*****' removes redacted values
Each record represents one employee (employment is always '1')

Using DuckDB for Multi-Year Analysis

Query multiple years efficiently using DuckDB:

import duckdb

# Create a view from multiple Parquet files
con = duckdb.connect('fedscope.duckdb')
con.execute("""
    CREATE VIEW employment AS 
    SELECT * FROM read_parquet('fedscope_employment_September_2024.parquet')
    UNION ALL
    SELECT * FROM read_parquet('fedscope_employment_September_2023.parquet')
""")

# Query across years
result = con.execute("""
    SELECT year, agysubt, SUM(CAST(employment AS INTEGER)) as employees
    FROM employment
    GROUP BY year, agysubt
    ORDER BY year, employees DESC
""").fetchdf()

📊 DuckDB Benefits:

Query multiple years without loading all data into memory
SQL interface for complex aggregations
Efficient columnar processing of Parquet files
See examples.py for complete DuckDB usage

Complete Field Reference

Field Format Structure: The dataset contains three types of fields:

Code Fields: Original FedScope codes used for categorization and joining (e.g., agelvl, edlvl, occ)
Description Fields: Human-readable labels derived from lookup tables (e.g., agelvlt, edlvlt, occt)
Data Fields: Actual analytical values (e.g., employment, salary, year)

Naming Pattern: Description fields follow the pattern of adding 't' to the code field name (e.g., agelvl → agelvlt).
For Analysis: Use the description fields (ending in 't') which provide human-readable values. Code fields are primarily for data processing and joins.

Core Identifiers

Field Name	Description	Type
`dataset_key`	Unique identifier for each quarterly dataset	String
`year`	Calendar year of the snapshot	Integer
`quarter`	Quarter of the snapshot (March, June, September, December)	String

Demographics

Code Field	Description Field	Description	Example Values
`agelvl`	`agelvlt`	Age level (5-year bands)	A: < 20, B: 20-24, C: 25-29, etc.
`edlvl`	`edlvlt`	Education level	00-22 (High school through Doctorate)
`los`	N/A	Length of service (data field)	Years of federal service

Job Characteristics

Code Field	Description Field	Description	Details
`occ`	`occt`	Occupation code and series	4-digit OPM occupation codes
`patco`	`patcot`	PATCO category	Professional, Administrative, Technical, Clerical, Other
`pp`	`ppt`	Pay plan	GS, ES, SL, ST, etc. (pp added in 2017, ppt added in 2018)
`ppgrd`	`ppgrdt`	Pay plan and grade	Combined pay plan and grade level
`gsegrd`	`gsegrdt`	GS equivalent grade	01-15, SES, or blank
`supervis`	`supervist`	Supervisory status	2: Supervisor, 4: Manager, 6: Leader, 8: Non-supervisor

Compensation & Work Details

Code Field	Description Field	Description	Notes
`salary`	N/A	Annual salary (data field)	Adjusted basic pay (null when redacted with asterisks)
`sallvl`	`sallvlt`	Salary level	Salary ranges in $10K bands
`wrksch`	`wrkscht`	Work schedule	F: Full-time, P: Part-time, etc.
`toa`	`toat`	Type of appointment	Permanent, temporary, term, etc.
`wkstat`	`wkstatt`	Work status	Active, leave without pay, etc.

Organization & Location

Code Field	Description Field	Description	Format
`agy`	N/A	Agency code (from lookup)	4-character agency identifier
`agysub`	`agysubt`	Sub-agency code	4-character sub-agency identifier
`loc`	`loct`	Location code	State abbreviation or country code
`stemocc`	`stemocct`	STEM occupation indicator	1: STEM, 0: Non-STEM

Coding Schemes

Age Levels (agelvl)

Age is reported in 5-year bands to protect privacy:

A: Less than 20 years
B: 20-24 years
C: 25-29 years
D: 30-34 years
E: 35-39 years
F: 40-44 years
G: 45-49 years
H: 50-54 years
I: 55-59 years
J: 60-64 years
K: 65 years and over

Education Levels (edlvl)

Education codes range from 00 to 22:

00-05: Less than high school
06-11: High school diploma/GED
12-14: Some college
15: Bachelor's degree
16-17: Some graduate work
18: Master's degree
19-21: Professional/Doctoral degree
22: Post-doctoral

PATCO Categories

P: Professional - Requires knowledge in specialized field (engineers, scientists, doctors)
A: Administrative - Management or staff functions (HR, budget, procurement)
T: Technical - Support to professional/administrative (technicians, assistants)
C: Clerical - Structured work support (data entry, filing)
O: Other - Trades, crafts, labor, protective services

Major Pay Plans

GS: General Schedule (most common)
ES: Senior Executive Service
SL/ST: Senior Level/Scientific & Technical
AD: Administratively Determined
FW: Federal Wage System

Data Notes & Limitations

What You Should Know

Missing Salaries: Some salary information is marked as private (shows as null). This affects about 8% of records.

Agency Name Changes: Government agencies get reorganized and renamed over 26 years. The same agency might appear with different names in different time periods.

Field Changes Over Time: Pay plan information (pp added in 2017, ppt added in 2018), so earlier years don't include these fields.

Data Coverage by Year

1998-2008: September only (1 file per year)
2009: September and December (2 files)
2010-2024: March, June, September, December (4 files per year)
2024: Through September (3 files so far)

What's NOT Included

Military personnel (separate dataset)
Postal Service employees
Intelligence agencies
Contractor information

Employment Snapshots: This shows who was employed at specific points in time. If you need data on new hires or separations specifically, those are available in separate FedScope datasets.

Resources & Links

Get Started

🚀 Run examples.py 📊 Download Parquet Files 📖 Full Repository

All files are hosted directly in this GitHub repository: 72 quarterly Parquet files in fedscope_data/parquet/ and original ZIP files in fedscope_data/raw/ for easy download and replication.

Official Government Sources

🏛️ FedScope Web Interface 📁 Original ZIP Files

Need Help?

Understanding the data: Read this guide or check the README
Technical issues: Report on GitHub
Official questions: Contact OPM directly

Citation

If you use this dataset, please cite:

Haddad, A. (2024). FedScope Employment Cube Dataset [Data set]. GitHub. https://github.com/abigailhaddad/fedscope_employment
        

FedScope Employment Cube

Quick Navigation