Back to Projects

github-scraper

Scrape github profile, repositories

FastAPI
0 Stars
1 Forks
Created: 2/11/2026
Updated: 2/12/2026

GitHub Scraper FastAPI

A RESTful API for scraping GitHub profiles, repositories, and README files built with FastAPI.

๐Ÿš€ Features

  • โœ… Async Operations - High-performance async scraping with aiohttp
  • โœ… Background Jobs - Long-running scrapes with job tracking
  • โœ… Caching - In-memory caching with TTL support
  • โœ… Multiple Export Formats - Excel, CSV, JSON
  • โœ… Rate Limiting - Built-in rate limit protection
  • โœ… OpenAPI Docs - Auto-generated interactive API documentation
  • โœ… CORS Support - Configurable CORS for web clients
  • โœ… Job Management - Track, cancel, and cleanup background jobs

๐Ÿ“‹ Requirements

  • Python 3.8+
  • FastAPI
  • uvicorn
  • aiohttp
  • pandas
  • openpyxl

๐Ÿ”ง Installation

1. Clone or Extract

cd github-scraper-api

2. Create Virtual Environment

python -m venv venv

# Activate (Linux/Mac)
source venv/bin/activate

# Activate (Windows)
venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Configure Environment

cp .env.example .env
# Edit .env with your settings

๐Ÿƒ Running the API

Development Mode

# Start with auto-reload
uvicorn app.main:app --reload

# Or use the main module
python -m app.main

Production Mode

uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4

The API will be available at:

  • API: http://localhost:8000
  • Interactive Docs: http://localhost:8000/docs
  • Alternative Docs: http://localhost:8000/redoc

๐Ÿ“š API Endpoints

Health & Info

# Root endpoint
GET /

# Health check
GET /health

# API statistics
GET /api/v1/stats

Scraping Endpoints

Scrape User Profile

GET /api/v1/scrape/profile/{username}?token=YOUR_TOKEN

Example:

curl http://localhost:8000/api/v1/scrape/profile/octocat

Response:

{
  "success": true,
  "username": "octocat",
  "profile": {
    "login": "octocat",
    "name": "The Octocat",
    "bio": "...",
    "public_repos": 8,
    "followers": 9999
  },
  "cached": false,
  "timestamp": "2024-02-11T10:00:00"
}

Scrape Repositories

GET /api/v1/scrape/repositories/{username}?max_repos=50&include_readme=true

Example:

curl "http://localhost:8000/api/v1/scrape/repositories/torvalds?max_repos=10"

Complete Scrape

GET /api/v1/scrape/complete/{username}

Example:

curl http://localhost:8000/api/v1/scrape/complete/octocat

Response:

{
  "success": true,
  "username": "octocat",
  "profile": {...},
  "repositories": [...],
  "total_stars": 12345,
  "total_forks": 5678,
  "top_languages": {
    "Python": 10,
    "JavaScript": 5
  }
}

Async Scraping (Background Job)

POST /api/v1/scrape/async/{username}

Request Body:

{
  "username": "torvalds",
  "token": "your_token",
  "max_repos": 100,
  "include_readme": true,
  "truncate_readme": true,
  "export_format": "excel"
}

Example:

curl -X POST http://localhost:8000/api/v1/scrape/async/torvalds \
  -H "Content-Type: application/json" \
  -d '{"username": "torvalds", "export_format": "excel"}'

Response:

{
  "job_id": "123e4567-e89b-12d3-a456-426614174000",
  "status": "pending",
  "message": "Scraping job started",
  "status_url": "/api/v1/jobs/123e4567-e89b-12d3-a456-426614174000"
}

Job Management

Get Job Status

GET /api/v1/jobs/{job_id}

Response:

{
  "job_id": "123e4567-e89b-12d3-a456-426614174000",
  "status": "completed",
  "username": "torvalds",
  "progress": 100,
  "result": {...},
  "export_files": ["torvalds_data.xlsx"],
  "created_at": "2024-02-11T10:00:00",
  "updated_at": "2024-02-11T10:05:00"
}

List All Jobs

GET /api/v1/jobs?status=completed&limit=50

Cancel Job

POST /api/v1/jobs/{job_id}/cancel

Delete Job

DELETE /api/v1/jobs/{job_id}

Export Endpoints

Export Job Data

GET /api/v1/export/{job_id}/{format}?download=true

Formats: excel, csv, json

Example:

# Export to Excel
curl http://localhost:8000/api/v1/export/JOB_ID/excel

# Download directly
curl -O http://localhost:8000/api/v1/export/JOB_ID/excel?download=true

Download File

GET /api/v1/download/{job_id}/{filename}

List Export Files

GET /api/v1/export/{job_id}/files

๐Ÿ” Authentication

Set your GitHub token in the .env file or pass it as a query parameter:

# In .env
GITHUB_TOKEN=ghp_your_token_here

# Or as query parameter
?token=ghp_your_token_here

๐Ÿ’พ Caching

The API caches responses for better performance:

  • Cache TTL: 1 hour (configurable)
  • Max Entries: 1000 (configurable)
  • LRU Eviction: Automatic cleanup

Disable cache per request:

GET /api/v1/scrape/profile/octocat?use_cache=false

๐Ÿ“Š Response Format

All endpoints return JSON with consistent structure:

Success Response:

{
  "success": true,
  "data": {...},
  "cached": false,
  "timestamp": "2024-02-11T10:00:00"
}

Error Response:

{
  "error": "Error message",
  "detail": "Detailed error information",
  "status_code": 404,
  "timestamp": "2024-02-11T10:00:00"
}

๐Ÿงช Testing

Using cURL

# Profile
curl http://localhost:8000/api/v1/scrape/profile/octocat

# Repositories
curl "http://localhost:8000/api/v1/scrape/repositories/octocat?max_repos=5"

# Complete scrape
curl http://localhost:8000/api/v1/scrape/complete/octocat

Using Python

import requests

# Scrape profile
response = requests.get('http://localhost:8000/api/v1/scrape/profile/octocat')
data = response.json()
print(data)

# Start async job
job_response = requests.post(
    'http://localhost:8000/api/v1/scrape/async/torvalds',
    json={
        'username': 'torvalds',
        'max_repos': 50,
        'export_format': 'excel'
    }
)
job_id = job_response.json()['job_id']

# Check job status
status = requests.get(f'http://localhost:8000/api/v1/jobs/{job_id}')
print(status.json())

Using JavaScript/Fetch

// Scrape profile
const response = await fetch('http://localhost:8000/api/v1/scrape/profile/octocat');
const data = await response.json();
console.log(data);

// Start async job
const jobResponse = await fetch('http://localhost:8000/api/v1/scrape/async/torvalds', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    username: 'torvalds',
    max_repos: 50,
    export_format: 'excel'
  })
});
const job = await jobResponse.json();
console.log(job.job_id);

๐Ÿ“ Project Structure

github-scraper-api/
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ main.py              # FastAPI app & startup
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ config.py        # Settings & configuration
โ”‚   โ”‚   โ”œโ”€โ”€ cache.py         # Cache manager
โ”‚   โ”‚   โ””โ”€โ”€ jobs.py          # Job manager
โ”‚   โ”œโ”€โ”€ models/
โ”‚   โ”‚   โ””โ”€โ”€ schemas.py       # Pydantic models
โ”‚   โ”œโ”€โ”€ routers/
โ”‚   โ”‚   โ”œโ”€โ”€ scraper.py       # Scraping endpoints
โ”‚   โ”‚   โ”œโ”€โ”€ jobs.py          # Job endpoints
โ”‚   โ”‚   โ””โ”€โ”€ export.py        # Export endpoints
โ”‚   โ””โ”€โ”€ services/
โ”‚       โ”œโ”€โ”€ scraper.py       # GitHub scraper service
โ”‚       โ””โ”€โ”€ exporter.py      # Export service
โ”œโ”€โ”€ tests/                   # Test files
โ”œโ”€โ”€ data/                    # Output directory
โ”œโ”€โ”€ requirements.txt         # Dependencies
โ”œโ”€โ”€ .env.example            # Environment template
โ”œโ”€โ”€ .env                    # Environment variables (create this)
โ””โ”€โ”€ README.md               # This file

โš™๏ธ Configuration

Edit .env file to configure:

# API Settings
DEBUG=False
HOST=0.0.0.0
PORT=8000

# GitHub Token (for higher rate limits)
GITHUB_TOKEN=your_token_here

# Cache Settings
CACHE_TTL=3600              # 1 hour
CACHE_MAX_SIZE=1000

# Job Settings
JOB_TIMEOUT=600             # 10 minutes
JOB_RETENTION_DAYS=7

# Scraping
DEFAULT_MAX_REPOS=100
REQUEST_DELAY=0.5

๐Ÿšฆ Rate Limits

Without Token

  • 60 requests/hour

With Token

  • 5000 requests/hour

The API automatically handles rate limits and displays warnings.

๐Ÿ”„ Background Jobs

Long-running scrapes are handled as background jobs:

  1. Create Job - POST to /api/v1/scrape/async/{username}
  2. Check Status - GET /api/v1/jobs/{job_id}
  3. Download Results - GET /api/v1/export/{job_id}/excel

Jobs are automatically cleaned up after 7 days.

๐Ÿ“ Examples

Example 1: Quick Profile Check

curl http://localhost:8000/api/v1/scrape/profile/octocat | jq

Example 2: Async Scrape with Export

# Start job
JOB_ID=$(curl -X POST http://localhost:8000/api/v1/scrape/async/torvalds \
  -H "Content-Type: application/json" \
  -d '{"username": "torvalds", "export_format": "excel"}' \
  | jq -r '.job_id')

# Wait and check status
sleep 30
curl http://localhost:8000/api/v1/jobs/$JOB_ID | jq

# Download results
curl -O http://localhost:8000/api/v1/export/$JOB_ID/excel?download=true

Example 3: Batch Scraping

import requests
import time

users = ['octocat', 'torvalds', 'gvanrossum']
jobs = []

# Start all jobs
for user in users:
    response = requests.post(
        f'http://localhost:8000/api/v1/scrape/async/{user}',
        json={'username': user, 'export_format': 'json'}
    )
    jobs.append(response.json()['job_id'])

# Wait for completion
for job_id in jobs:
    while True:
        status = requests.get(f'http://localhost:8000/api/v1/jobs/{job_id}').json()
        if status['status'] in ['completed', 'failed']:
            break
        time.sleep(5)
    
    print(f"Job {job_id}: {status['status']}")

๐Ÿณ Docker Support

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
# Build
docker build -t github-scraper-api .

# Run
docker run -p 8000:8000 -e GITHUB_TOKEN=your_token github-scraper-api

๐Ÿค Contributing

Contributions welcome! Please see the main project's CONTRIBUTING.md.

๐Ÿ“„ License

MIT License - See LICENSE file for details.

๐Ÿ†˜ Troubleshooting

Issue: Module not found

# Make sure you're in the right directory
cd github-scraper-api
pip install -r requirements.txt

Issue: Port already in use

# Use a different port
uvicorn app.main:app --port 8001

Issue: Rate limit exceeded

# Add GitHub token to .env
GITHUB_TOKEN=your_token_here

๐Ÿ“š Additional Resources

  • FastAPI Docs: https://fastapi.tiangolo.com
  • GitHub API: https://docs.github.com/en/rest
  • Interactive API Docs: http://localhost:8000/docs

Built with FastAPI โšก | Powered by GitHub API ๐Ÿš€