European Portuguese NLP Research (2021–2026)

LLMs

Project Year Sizes available What it is What it advanced ArXiv paper ArXiv HTML Hugging Face model / dataset Hugging Face org GitHub repo
Albertina PT family 2023 100M, 900M DeBERTa-based Portuguese foundation encoder with pt-PT and pt-BR variants. Established the strongest open pt-PT encoder baseline and treated European Portuguese as its own target. 2305.06721 HTML Model; GLUE-PTPT dataset PORTULAN Not verified
Albertina PT family expansion 2024 100M, 900M, 1.5B Expanded family of open Portuguese encoders at multiple sizes. Turned Albertina into an ecosystem rather than a single model. 2403.01897 HTML 1.5B pt-PT model PORTULAN Not verified
Gervásio PT family 2024 7B, 8B, 70B Fully open instruction-tuned decoder model for Portuguese, with pt-PT and pt-BR variants. One of the earliest serious open Portuguese decoder-side models from Portugal. 2402.18766 HTML 7B pt-PT model PORTULAN Not verified
GlórIA 2024 1.3B, 2.7B Open generative LLM for Portuguese with strong European Portuguese orientation. Pushed pt-PT into the decoder / LLM era and introduced CALAME-PT. 2402.12969 HTML Model; CALAME-PT dataset NOVA-vision-language rvlopes/GlorIA
MediAlbertina 2024 900M, 1.5B Domain-adapted European Portuguese medical language model built on Albertina. Brought pt-PT modeling into the medical domain. Not verified Not verified HF model portugueseNLP Not verified
AMALIA 2026 9B Fully open pt-PT-first LLM paired with native pt-PT evaluation. Current flagship for European Portuguese LLM work. 2603.26511 HTML Not verified Not verified AMALIA-LLM/AMALIA

Datasets

Project Year What it is What it advanced ArXiv paper ArXiv HTML Hugging Face model / dataset Hugging Face org GitHub repo
From Brazilian Portuguese to European Portuguese 2024 pt-BR → pt-PT translation study with a manually curated gold test set. Created reusable native evaluation material for an important pt-PT correction task. 2408.07457 HTML Not verified Not verified Not verified
PtBrVarId 2025 Cross-domain dataset for distinguishing European and Brazilian Portuguese. Improved the curation pipeline needed to separate pt-PT from pt-BR across datasets and models. 2502.14394 HTML Dataset; Model liaad LIAAD/portuguese_vid
Tradutor / PTradutor 2025 Open European Portuguese translation model plus dedicated parallel dataset. Made pt-BR → pt-PT translation an open, reproducible research problem. 2502.14385 HTML PTradutor dataset hugosousa hmosousa/tradutor; hmosousa/ptradutor
CitiLink-Minutes 2026 Multilayer pt-PT dataset of municipal meeting minutes. Opened a practical civic-language dataset for European Portuguese. 2602.12137 HTML Representative HF model inesctec INESCTEC/citilink-dataset
CitiLink-Summ 2026 pt-PT summarization dataset for discussion subjects in municipal meeting minutes. Made summarization on real public-administration text possible in pt-PT. 2602.16607 HTML HF summarization model inesctec INESCTEC/citilink-summ
ClaimPT 2026 European Portuguese dataset for claim detection in news. Gave pt-PT fact-checking a proper research base using licensed news content. 2601.19490 HTML HF model lfcc LIAAD/ClaimPT

Benchmarks / evals

Project Year What it is What it advanced ArXiv paper ArXiv HTML Hugging Face model / dataset Hugging Face org GitHub repo
DSL-TL / Language Variety Identification with True Labels 2023 Human-annotated benchmark for language variety identification, including pt-PT vs pt-BR. Fixed a common evaluation flaw: assuming the source of a text reveals its variety. 2303.01490 Not verified Not verified Not verified LanguageTechnologyLab/DSL-TL
CALAME-PT 2024 Portuguese zero-shot language-model benchmark introduced with GlórIA. Gave Portuguese decoder models a shared evaluation surface. GlórIA paper HTML HF dataset NOVA-vision-language rvlopes/GlorIA
CAMÕES 2025 Open benchmark for European Portuguese ASR and other Portuguese varieties. Gave European Portuguese speech recognition a serious open benchmark. 2508.19721 HTML HF dataset inesc-id Not verified
ALBA 2026 Linguistically grounded pt-PT benchmark for generative LLMs across eight linguistic dimensions. Moved pt-PT evaluation toward native linguistic and cultural fidelity instead of translated proxies. 2603.26516 HTML Not verified Not verified AMALIA-LLM/alba-benchmark
CLARIN-PT-LDB 2026 Open pt-PT LLM leaderboard focused on language, culture, and civility. Made pt-PT LLM evaluation public and reproducible. 2603.12872 HTML HF Space PORTULAN Not verified
AMALIA Eval Suite 2026 Native pt-PT evaluation suite released inside the AMALIA technical report. Strengthened the move from translated benchmarks to native pt-PT evaluation. 2603.26511 HTML Not verified Not verified AMALIA-LLM/AMALIA