| --- |
| datasets: |
| - Xerv-AI/netuark-posts-6000 |
| --- |
| |
| # NetuArk Posts Classifier (Ensemble Architecture) |
|
|
| This model is a novel ensemble classifier designed to categorize technology-related social media posts into their respective news sources. |
| The model is trained to classify the following sources: |
| - ArsTechnica |
| - FT |
| - GuardianTech |
| - HackerNews |
| - Slashdot |
| - TechCrunch |
| - TheVerge |
| - |
| ## Model Details |
| - **Architecture:** Voting Classifier (Multinomial Naive Bayes + Logistic Regression) |
| - **Vectorization:** TF-IDF (N-grams 1-3) |
| - **Accuracy:** 99.81% on the NetuArk-6000 dataset. |
| - **Classes:** HackerNews, TechCrunch, TheVerge, FT, GuardianTech, Slashdot, ArsTechnica. |
|
|
| ## Training Data |
| Trained on the [Xerv-AI/netuark-posts-6000](https://huggingface.co/datasets/Xerv-AI/netuark-posts-6000) dataset. |
|
|
| ## Usage |
| ```python |
| import joblib |
| import os |
| from huggingface_hub import hf_hub_download |
| |
| # Define the missing custom function required by the unpickler |
| def advanced_clean(text): |
| return text |
| |
| # Assign it to __main__ to ensure joblib can find it during loading |
| import __main__ |
| __main__.advanced_clean = advanced_clean |
| |
| # Repository and filename |
| repo_id = 'Phase-Technologies/netuark-classifier-ensemble' |
| filename = 'netuark_ensemble_classifier.joblib' |
| |
| try: |
| # Download the file from Hugging Face |
| file_path = hf_hub_download(repo_id=repo_id, filename=filename) |
| |
| # Load the model |
| model = joblib.load(file_path) |
| prediction = model.predict(["📰 Perplexity's 'Personal Computer' Lets AI Agents Access Your Local Files #slashdot"]) |
| print(f"Prediction: {prediction}") |
| except Exception as e: |
| import traceback |
| print(f"An error occurred: {e}") |
| traceback.print_exc() |
| ``` |