Understanding the 140+ million federal employee records (1998-2024)
⚠️ This is unofficial processed data. For official information, visit fedscope.opm.gov
This dataset contains 140+ million records of federal civilian employees from 1998 to 2024. Each record represents one employee in a specific quarter with their job details, demographics, and compensation.
Each employee record has 52 fields that fall into three categories:
agelvl
, occ
)agelvlt
, occt
)salary
, employment
)Start with these key fields for most analyses:
Field | What It Shows | Example Values |
---|---|---|
year , quarter |
When the data was collected | 2024, "September" |
agysubt |
Which agency/department | "Department of Defense", "Internal Revenue Service" |
loct |
Where they work | "CALIFORNIA", "DISTRICT OF COLUMBIA" |
occt |
Their job title | "Accountant", "Computer Scientist", "Human Resources Specialist" |
patcot |
Job category | "Professional", "Administrative", "Technical" |
salary |
Annual salary (dollars) | 65000, 95000, null (when private) |
sallvlt |
Salary range | "$60,000 - $69,999", "$90,000 - $99,999" |
agelvlt |
Age group | "25-29 YEARS", "45-49 YEARS" |
edlvlt |
Education level | "Bachelor's Degree", "Master's Degree" |
employment |
Person count (always 1) | 1 |
employment
to count total people.employment
and salary
are stored as strings*****
are redacted and should be filtered out# Load from GitHub (without cloning)
df = pd.read_parquet('https://github.com/abigailhaddad/fedscope_employment/raw/main/fedscope_data/parquet/fedscope_employment_September_2024.parquet')
# Load locally (if you've cloned the repo)
df = pd.read_parquet('fedscope_data/parquet/fedscope_employment_September_2024.parquet')
# Employment is stored as strings, convert properly
agency_counts = df.groupby('agysubt')['employment'].apply(
lambda x: sum(int(i) for i in x)
).sort_values(ascending=False).head(10)
# Convert salary to numeric, handling edge cases
df['salary_numeric'] = df['salary'].apply(lambda x: int(float(x)) if x not in [None, 'nan', '*****', ''] and pd.notna(x) else None)
df_with_salary = df[df['salary_numeric'].notna()]
salary_by_edu = df_with_salary.groupby('edlvlt')['salary_numeric'].mean().sort_values(ascending=False)
# Group by time periods
quarterly = df.groupby(['year', 'quarter'])['employment'].apply(
lambda x: sum(int(i) for i in x)
)
int(i)
df['salary'] != '*****'
removes redacted valuesemployment
is always '1')Query multiple years efficiently using DuckDB:
import duckdb
# Create a view from multiple Parquet files
con = duckdb.connect('fedscope.duckdb')
con.execute("""
CREATE VIEW employment AS
SELECT * FROM read_parquet('fedscope_employment_September_2024.parquet')
UNION ALL
SELECT * FROM read_parquet('fedscope_employment_September_2023.parquet')
""")
# Query across years
result = con.execute("""
SELECT year, agysubt, SUM(CAST(employment AS INTEGER)) as employees
FROM employment
GROUP BY year, agysubt
ORDER BY year, employees DESC
""").fetchdf()
agelvl
, edlvl
, occ
)agelvlt
, edlvlt
, occt
)employment
, salary
, year
)agelvl
→ agelvlt
).Field Name | Description | Type |
---|---|---|
dataset_key |
Unique identifier for each quarterly dataset | String |
year |
Calendar year of the snapshot | Integer |
quarter |
Quarter of the snapshot (March, June, September, December) | String |
Code Field | Description Field | Description | Example Values |
---|---|---|---|
agelvl |
agelvlt |
Age level (5-year bands) | A: < 20, B: 20-24, C: 25-29, etc. |
edlvl |
edlvlt |
Education level | 00-22 (High school through Doctorate) |
los |
N/A | Length of service (data field) | Years of federal service |
Code Field | Description Field | Description | Details |
---|---|---|---|
occ |
occt |
Occupation code and series | 4-digit OPM occupation codes |
patco |
patcot |
PATCO category | Professional, Administrative, Technical, Clerical, Other |
pp |
ppt |
Pay plan | GS, ES, SL, ST, etc. (pp added in 2017, ppt added in 2018) |
ppgrd |
ppgrdt |
Pay plan and grade | Combined pay plan and grade level |
gsegrd |
gsegrdt |
GS equivalent grade | 01-15, SES, or blank |
supervis |
supervist |
Supervisory status | 2: Supervisor, 4: Manager, 6: Leader, 8: Non-supervisor |
Code Field | Description Field | Description | Notes |
---|---|---|---|
salary |
N/A | Annual salary (data field) | Adjusted basic pay (null when redacted with asterisks) |
sallvl |
sallvlt |
Salary level | Salary ranges in $10K bands |
wrksch |
wrkscht |
Work schedule | F: Full-time, P: Part-time, etc. |
toa |
toat |
Type of appointment | Permanent, temporary, term, etc. |
wkstat |
wkstatt |
Work status | Active, leave without pay, etc. |
Code Field | Description Field | Description | Format |
---|---|---|---|
agy |
N/A | Agency code (from lookup) | 4-character agency identifier |
agysub |
agysubt |
Sub-agency code | 4-character sub-agency identifier |
loc |
loct |
Location code | State abbreviation or country code |
stemocc |
stemocct |
STEM occupation indicator | 1: STEM, 0: Non-STEM |
Age is reported in 5-year bands to protect privacy:
Education codes range from 00 to 22:
pp
added in 2017, ppt
added in 2018), so earlier years don't include these fields.
All files are hosted directly in this GitHub repository: 72 quarterly Parquet files in fedscope_data/parquet/
and original ZIP files in fedscope_data/raw/
for easy download and replication.
If you use this dataset, please cite:
This is UNOFFICIAL documentation created as part of an independent data processing project
For official information, visit fedscope.opm.gov
Last updated: 2024