README.md · chcaa/da_dacy_large

Kenneth Enevoldsen

Added readme

963232f almost 3 years ago

15.1 kB

	---
	tags:
	- spacy
	- dacy
	- danish
	- token-classification
	- pos tagging
	- morphological analysis
	- lemmatization
	- dependency parsing
	- named entity recognition
	- coreference resolution
	- named entity linking
	- named entity disambiguation
	language:
	- da
	license: apache-2.0
	model-index:
	- name: da_dacy_large_trf-0.2.0
	results:
	- task:
	name: NER
	type: token-classification
	metrics:
	- name: NER Precision
	type: precision
	value: 0.8858195212
	- name: NER Recall
	type: recall
	value: 0.8620071685
	- name: NER F Score
	type: f_score
	value: 0.8737511353
	dataset:
	name: DaNE
	split: test
	type: dane
	- task:
	name: TAG
	type: token-classification
	metrics:
	- name: TAG (XPOS) Accuracy
	type: accuracy
	value: 0.9913668347
	dataset:
	name: UD Danish DDT
	split: test
	type: universal_dependencies
	config: da_ddt
	- task:
	name: POS
	type: token-classification
	metrics:
	- name: POS (UPOS) Accuracy
	type: accuracy
	value: 0.9908174469
	dataset:
	name: UD Danish DDT
	split: test
	type: universal_dependencies
	config: da_ddt
	- task:
	name: MORPH
	type: token-classification
	metrics:
	- name: Morph (UFeats) Accuracy
	type: accuracy
	value: 0.9880227568
	dataset:
	name: UD Danish DDT
	split: test
	type: universal_dependencies
	config: da_ddt
	- task:
	name: LEMMA
	type: token-classification
	metrics:
	- name: Lemma Accuracy
	type: accuracy
	value: 0.9589423796
	dataset:
	name: UD Danish DDT
	split: test
	type: universal_dependencies
	config: da_ddt
	- task:
	name: UNLABELED_DEPENDENCIES
	type: token-classification
	metrics:
	- name: Unlabeled Attachment Score (UAS)
	type: f_score
	value: 0.9280885781
	dataset:
	name: UD Danish DDT
	split: test
	type: universal_dependencies
	config: da_ddt
	- task:
	name: LABELED_DEPENDENCIES
	type: token-classification
	metrics:
	- name: Labeled Attachment Score (LAS)
	type: f_score
	value: 0.9079997669
	dataset:
	name: UD Danish DDT
	split: test
	type: universal_dependencies
	config: da_ddt
	- task:
	name: SENTS
	type: token-classification
	metrics:
	- name: Sentences F-Score
	type: f_score
	value: 1.0
	dataset:
	name: UD Danish DDT
	split: test
	type: universal_dependencies
	config: da_ddt
	- task:
	name: coreference-resolution
	type: coreference-resolution
	metrics:
	- name: LEA
	type: f_score
	value: 0.4672143289
	dataset:
	name: DaCoref
	type: alexandrainst/dacoref
	split: custom
	- task:
	name: coreference-resolution
	type: coreference-resolution
	metrics:
	- name: Named entity Linking Precision
	type: precision
	value: 0.84
	- name: Named entity Linking Recall
	type: recall
	value: 0.2153846154
	- name: Named entity Linking F Score
	type: f_score
	value: 0.3428571429
	dataset:
	name: DaNED
	type: named-entity-linking
	split: custom
	library_name: spacy
	datasets:
	- universal_dependencies
	- dane
	- alexandrainst/dacoref
	metrics:
	- accuracy
	---

	<a href="https://github.com/centre-for-humanities-computing/Dacy"><img src="https://centre-for-humanities-computing.github.io/DaCy/_static/icon.png" width="175" height="175" align="right" /></a>

	# DaCy large

	DaCy is a Danish language processing framework with state-of-the-art pipelines as well as functionality for analysing Danish pipelines.
	DaCy's largest pipeline has achieved State-of-the-Art performance on parts-of-speech tagging and dependency
	parsing for Danish on the Danish Dependency treebank as well as competitive performance on named entity recognition, named entity disambiguation and coreference resolution.
	To read more check out the [DaCy repository](https://github.com/centre-for-humanities-computing/DaCy) for material on how to use DaCy and reproduce the results.
	DaCy also contains guides on usage of the package as well as behavioural test for biases and robustness of Danish NLP pipelines.


	\| Feature \| Description \|
	\| --- \| --- \|
	\| Name \| `da_dacy_large_trf` \|
	\| Version \| `0.2.0` \|
	\| spaCy \| `>=3.5.2,<3.6.0` \|
	\| Default Pipeline \| `transformer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `parser`, `ner`, `coref`, `span_resolver`, `span_cleaner`, `entity_linker` \|
	\| Components \| `transformer`, `tagger`, `morphologizer`, `trainable_lemmatizer`, `parser`, `ner`, `coref`, `span_resolver`, `span_cleaner`, `entity_linker` \|
	\| Vectors \| 0 keys, 0 unique vectors (0 dimensions) \|
	\| Sources \| [UD Danish DDT v2.11](https://github.com/UniversalDependencies/UD_Danish-DDT) (Johannsen, Anders; Martínez Alonso, Héctor; Plank, Barbara)<br />[DaNE](https://huggingface.co/datasets/dane) (Rasmus Hvingelby, Amalie B. Pauli, Maria Barrett, Christina Rosted, Lasse M. Lidegaard, Anders Søgaard)<br />[DaCoref](https://huggingface.co/datasets/alexandrainst/dacoref) (Buch-Kromann, Matthias)<br />[DaNED](https://danlp-alexandra.readthedocs.io/en/stable/docs/datasets.html#daned) (Barrett, M. J., Lam, H., Wu, M., Lacroix, O., Plank, B., & Søgaard, A.)<br />[chcaa/dfm-encoder-large-v1](https://huggingface.co/chcaa/dfm-encoder-large-v1) (The Danish Foundation Models team) \|
	\| License \| `Apache-2.0` \|
	\| Author \| [Kenneth Enevoldsen](https://chcaa.io/#/) \|

	### Label Scheme

	<details>

	<summary>View label scheme (211 labels for 4 components)</summary>

	\| Component \| Labels \|
	\| --- \| --- \|
	\| `tagger` \| `ADJ`, `ADP`, `ADV`, `AUX`, `CCONJ`, `DET`, `INTJ`, `NOUN`, `NUM`, `PART`, `PRON`, `PROPN`, `PUNCT`, `SCONJ`, `SYM`, `VERB`, `X` \|
	\| `morphologizer` \| `AdpType=Prep\\|POS=ADP`, `Definite=Ind\\|Gender=Com\\|Number=Sing\\|POS=NOUN`, `Mood=Ind\\|POS=AUX\\|Tense=Pres\\|VerbForm=Fin\\|Voice=Act`, `POS=PROPN`, `Definite=Ind\\|Number=Sing\\|POS=VERB\\|Tense=Past\\|VerbForm=Part`, `Definite=Def\\|Gender=Neut\\|Number=Sing\\|POS=NOUN`, `POS=SCONJ`, `Definite=Def\\|Gender=Com\\|Number=Sing\\|POS=NOUN`, `Mood=Ind\\|POS=VERB\\|Tense=Pres\\|VerbForm=Fin\\|Voice=Act`, `POS=ADV`, `Number=Plur\\|POS=DET\\|PronType=Dem`, `Degree=Pos\\|Number=Plur\\|POS=ADJ`, `Definite=Ind\\|Gender=Com\\|Number=Plur\\|POS=NOUN`, `POS=PUNCT`, `NumType=Ord\\|POS=ADJ`, `POS=CCONJ`, `Definite=Ind\\|Gender=Neut\\|Number=Plur\\|POS=NOUN`, `POS=VERB\\|VerbForm=Inf\\|Voice=Act`, `Case=Acc\\|Gender=Neut\\|Number=Sing\\|POS=PRON\\|Person=3\\|PronType=Prs`, `Degree=Sup\\|POS=ADV`, `Degree=Pos\\|POS=ADV`, `Gender=Com\\|Number=Sing\\|POS=DET\\|PronType=Ind`, `Number=Plur\\|POS=DET\\|PronType=Ind`, `POS=ADP`, `POS=ADV\\|PartType=Inf`, `Case=Nom\\|Gender=Com\\|Number=Sing\\|POS=PRON\\|Person=3\\|PronType=Prs`, `Mood=Ind\\|POS=AUX\\|Tense=Past\\|VerbForm=Fin\\|Voice=Act`, `Definite=Def\\|Degree=Pos\\|Number=Sing\\|POS=ADJ`, `Number[psor]=Sing\\|POS=DET\\|Person=3\\|Poss=Yes\\|PronType=Prs`, `Mood=Ind\\|POS=VERB\\|Tense=Past\\|VerbForm=Fin\\|Voice=Act`, `POS=ADP\\|PartType=Inf`, `Definite=Ind\\|Degree=Pos\\|Gender=Com\\|Number=Sing\\|POS=ADJ`, `NumType=Card\\|POS=NUM`, `Degree=Pos\\|POS=ADJ`, `Definite=Ind\\|Number=Sing\\|POS=AUX\\|Tense=Past\\|VerbForm=Part`, `POS=PART\\|PartType=Inf`, `Case=Acc\\|POS=PRON\\|Person=3\\|PronType=Prs\\|Reflex=Yes`, `Definite=Def\\|Gender=Com\\|Number=Plur\\|POS=NOUN`, `Definite=Ind\\|Gender=Neut\\|Number=Sing\\|POS=NOUN`, `Number[psor]=Plur\\|POS=DET\\|Person=3\\|Poss=Yes\\|PronType=Prs`, `POS=VERB\\|Tense=Pres\\|VerbForm=Part`, `Case=Nom\\|Number=Plur\\|POS=PRON\\|Person=3\\|PronType=Prs`, `Case=Gen\\|Definite=Def\\|Gender=Com\\|Number=Sing\\|POS=NOUN`, `Definite=Def\\|Degree=Sup\\|Number=Plur\\|POS=ADJ`, `Case=Acc\\|Number=Plur\\|POS=PRON\\|Person=3\\|PronType=Prs`, `POS=AUX\\|VerbForm=Inf\\|Voice=Act`, `Definite=Ind\\|Degree=Pos\\|Gender=Neut\\|Number=Sing\\|POS=ADJ`, `Definite=Ind\\|Degree=Cmp\\|Number=Sing\\|POS=ADJ`, `Degree=Cmp\\|POS=ADJ`, `POS=PRON\\|PartType=Inf`, `Definite=Ind\\|Degree=Pos\\|Number=Sing\\|POS=ADJ`, `Case=Nom\\|Gender=Com\\|POS=PRON\\|PronType=Ind`, `Number=Plur\\|POS=PRON\\|PronType=Ind`, `POS=INTJ`, `Gender=Com\\|Number=Sing\\|POS=DET\\|PronType=Dem`, `Case=Gen\\|Number=Plur\\|POS=DET\\|PronType=Ind`, `Mood=Ind\\|POS=VERB\\|Tense=Pres\\|VerbForm=Fin\\|Voice=Pass`, `Definite=Def\\|Gender=Neut\\|Number=Plur\\|POS=NOUN`, `Degree=Cmp\\|POS=ADV`, `Number=Plur\\|Number[psor]=Plur\\|POS=PRON\\|Person=1\\|Poss=Yes\\|PronType=Prs\\|Style=Form`, `Case=Acc\\|Gender=Com\\|Number=Sing\\|POS=PRON\\|Person=3\\|PronType=Prs`, `Number=Plur\\|Number[psor]=Sing\\|POS=DET\\|Person=3\\|Poss=Yes\\|PronType=Prs\\|Reflex=Yes`, `Case=Gen\\|POS=PROPN`, `Gender=Neut\\|Number=Sing\\|POS=PRON\\|PronType=Ind`, `Number=Plur\\|POS=VERB\\|Tense=Past\\|VerbForm=Part`, `Gender=Neut\\|Number=Sing\\|Number[psor]=Sing\\|POS=DET\\|Person=3\\|Poss=Yes\\|PronType=Prs\\|Reflex=Yes`, `Case=Acc\\|Gender=Com\\|Number=Sing\\|POS=PRON\\|Person=1\\|PronType=Prs`, `Definite=Def\\|Degree=Sup\\|POS=ADJ`, `Gender=Neut\\|Number=Sing\\|POS=DET\\|PronType=Ind`, `Case=Gen\\|Definite=Ind\\|Gender=Neut\\|Number=Sing\\|POS=NOUN`, `Gender=Neut\\|Number=Sing\\|POS=DET\\|PronType=Dem`, `Definite=Def\\|Number=Sing\\|POS=VERB\\|Tense=Past\\|VerbForm=Part`, `POS=PRON\\|PronType=Dem`, `Degree=Pos\\|Gender=Com\\|Number=Sing\\|POS=ADJ`, `Number=Plur\\|POS=NUM`, `POS=VERB\\|VerbForm=Inf\\|Voice=Pass`, `Definite=Def\\|Degree=Sup\\|Number=Sing\\|POS=ADJ`, `Number=Sing\\|POS=PRON\\|PronType=Int,Rel`, `Case=Nom\\|Gender=Com\\|Number=Sing\\|POS=PRON\\|Person=1\\|PronType=Prs`, `Gender=Neut\\|Number=Sing\\|Number[psor]=Sing\\|POS=DET\\|Person=1\\|Poss=Yes\\|PronType=Prs`, `Gender=Com\\|Number=Sing\\|Number[psor]=Sing\\|POS=DET\\|Person=1\\|Poss=Yes\\|PronType=Prs`, `POS=PRON`, `Definite=Ind\\|Number=Sing\\|POS=NOUN`, `Definite=Ind\\|Number=Sing\\|POS=NUM`, `Case=Gen\\|Definite=Ind\\|Gender=Com\\|Number=Sing\\|POS=NOUN`, `Foreign=Yes\\|POS=ADV`, `POS=NOUN`, `Case=Gen\\|Definite=Def\\|Gender=Neut\\|Number=Sing\\|POS=NOUN`, `Gender=Com\\|Number=Plur\\|POS=NOUN`, `Gender=Neut\\|Number=Sing\\|POS=PRON\\|PronType=Int,Rel`, `Case=Nom\\|Gender=Com\\|Number=Plur\\|POS=PRON\\|Person=1\\|PronType=Prs`, `Number[psor]=Plur\\|POS=DET\\|Person=1\\|Poss=Yes\\|PronType=Prs`, `Gender=Com\\|Number=Sing\\|POS=PRON\\|PronType=Ind`, `Case=Gen\\|Definite=Ind\\|Gender=Com\\|Number=Plur\\|POS=NOUN`, `Degree=Pos\\|Gender=Neut\\|Number=Sing\\|POS=ADJ`, `Degree=Sup\\|POS=ADJ`, `Degree=Pos\\|Number=Sing\\|POS=ADJ`, `Mood=Imp\\|POS=VERB`, `Case=Nom\\|Gender=Com\\|POS=PRON\\|Person=2\\|Polite=Form\\|PronType=Prs`, `Case=Acc\\|Gender=Com\\|POS=PRON\\|Person=2\\|Polite=Form\\|PronType=Prs`, `POS=X`, `Case=Gen\\|Definite=Def\\|Gender=Com\\|Number=Plur\\|POS=NOUN`, `Number=Plur\\|POS=PRON\\|PronType=Dem`, `Case=Acc\\|Gender=Com\\|Number=Plur\\|POS=PRON\\|Person=1\\|PronType=Prs`, `Number=Plur\\|POS=PRON\\|PronType=Int,Rel`, `Gender=Com\\|Number=Sing\\|Number[psor]=Sing\\|POS=DET\\|Person=3\\|Poss=Yes\\|PronType=Prs\\|Reflex=Yes`, `Degree=Cmp\\|Number=Plur\\|POS=ADJ`, `Number=Plur\\|Number[psor]=Sing\\|POS=DET\\|Person=1\\|Poss=Yes\\|PronType=Prs`, `Gender=Com\\|Number=Sing\\|Number[psor]=Plur\\|POS=DET\\|Person=1\\|Poss=Yes\\|PronType=Prs\\|Style=Form`, `Case=Nom\\|Gender=Com\\|Number=Sing\\|POS=PRON\\|Person=2\\|PronType=Prs`, `Case=Acc\\|Gender=Com\\|Number=Sing\\|POS=PRON\\|Person=2\\|PronType=Prs`, `Gender=Com\\|POS=PRON\\|PronType=Int,Rel`, `Case=Gen\\|Degree=Pos\\|Number=Plur\\|POS=ADJ`, `Gender=Neut\\|Number=Sing\\|Number[psor]=Sing\\|POS=PRON\\|Person=3\\|Poss=Yes\\|PronType=Prs\\|Reflex=Yes`, `POS=VERB\\|VerbForm=Ger`, `Gender=Com\\|Number=Sing\\|POS=PRON\\|PronType=Dem`, `Case=Gen\\|POS=PRON\\|PronType=Int,Rel`, `Mood=Ind\\|POS=VERB\\|Tense=Past\\|VerbForm=Fin\\|Voice=Pass`, `Abbr=Yes\\|POS=X`, `Case=Gen\\|Definite=Ind\\|Gender=Neut\\|Number=Plur\\|POS=NOUN`, `Gender=Com\\|Number=Sing\\|Number[psor]=Sing\\|POS=DET\\|Person=2\\|Poss=Yes\\|PronType=Prs`, `Definite=Ind\\|Number=Plur\\|POS=NOUN`, `Foreign=Yes\\|POS=X`, `Number=Plur\\|POS=PRON\\|PronType=Rcp`, `Case=Nom\\|Gender=Com\\|Number=Plur\\|POS=PRON\\|Person=2\\|PronType=Prs`, `Case=Gen\\|Degree=Cmp\\|POS=ADJ`, `Case=Gen\\|Definite=Def\\|Gender=Neut\\|Number=Plur\\|POS=NOUN`, `Case=Acc\\|Gender=Com\\|Number=Plur\\|POS=PRON\\|Person=2\\|PronType=Prs`, `Gender=Neut\\|Number=Sing\\|POS=PRON\\|PronType=Dem`, `Number=Plur\\|Number[psor]=Plur\\|POS=DET\\|Person=1\\|Poss=Yes\\|PronType=Prs\\|Style=Form`, `Gender=Neut\\|Number=Sing\\|Number[psor]=Plur\\|POS=DET\\|Person=1\\|Poss=Yes\\|PronType=Prs\\|Style=Form`, `Number=Plur\\|Number[psor]=Sing\\|POS=PRON\\|Person=3\\|Poss=Yes\\|PronType=Prs\\|Reflex=Yes`, `Number[psor]=Sing\\|POS=PRON\\|Person=3\\|Poss=Yes\\|PronType=Prs`, `Case=Gen\\|Number=Plur\\|POS=PRON\\|PronType=Rcp`, `POS=DET\\|Person=2\\|Polite=Form\\|Poss=Yes\\|PronType=Prs`, `POS=SYM`, `POS=DET\\|PronType=Dem`, `Gender=Com\\|Number=Sing\\|POS=NUM`, `Number[psor]=Plur\\|POS=DET\\|Person=2\\|Poss=Yes\\|PronType=Prs`, `Case=Gen\\|Number=Plur\\|POS=VERB\\|Tense=Past\\|VerbForm=Part`, `Definite=Def\\|Degree=Abs\\|POS=ADJ`, `POS=VERB\\|Tense=Pres`, `Definite=Ind\\|Gender=Neut\\|Number=Sing\\|POS=NUM`, `Degree=Abs\\|POS=ADV`, `Case=Gen\\|Definite=Def\\|Degree=Pos\\|Number=Sing\\|POS=ADJ`, `Gender=Com\\|Number=Sing\\|POS=PRON\\|PronType=Int,Rel`, `POS=VERB\\|Tense=Past\\|VerbForm=Part`, `Definite=Ind\\|Degree=Sup\\|Number=Sing\\|POS=ADJ`, `Gender=Neut\\|Number=Sing\\|Number[psor]=Sing\\|POS=DET\\|Person=2\\|Poss=Yes\\|PronType=Prs`, `Gender=Com\\|Number=Sing\\|Number[psor]=Sing\\|POS=PRON\\|Person=1\\|Poss=Yes\\|PronType=Prs`, `Number=Plur\\|Number[psor]=Sing\\|POS=DET\\|Person=2\\|Poss=Yes\\|PronType=Prs`, `Number[psor]=Plur\\|POS=PRON\\|Person=3\\|Poss=Yes\\|PronType=Prs`, `Definite=Ind\\|POS=NOUN`, `Case=Gen\\|Gender=Com\\|Number=Sing\\|POS=DET\\|PronType=Ind`, `Definite=Ind\\|Gender=Com\\|Number=Sing\\|POS=NUM`, `Definite=Def\\|Number=Plur\\|POS=NOUN`, `Case=Gen\\|POS=NOUN`, `POS=AUX\\|Tense=Pres\\|VerbForm=Part` \|
	\| `parser` \| `ROOT`, `acl:relcl`, `advcl`, `advmod`, `advmod:lmod`, `amod`, `appos`, `aux`, `case`, `cc`, `ccomp`, `compound:prt`, `conj`, `cop`, `dep`, `det`, `expl`, `fixed`, `flat`, `iobj`, `list`, `mark`, `nmod`, `nmod:poss`, `nsubj`, `nummod`, `obj`, `obl`, `obl:lmod`, `obl:tmod`, `punct`, `xcomp` \|
	\| `ner` \| `LOC`, `MISC`, `ORG`, `PER` \|

	</details>

	### Accuracy

	\| Type \| Score \|
	\| --- \| --- \|
	\| `TOKEN_ACC` \| 99.92 \|
	\| `TOKEN_P` \| 99.70 \|
	\| `TOKEN_R` \| 99.77 \|
	\| `TOKEN_F` \| 99.74 \|
	\| `SENTS_P` \| 100.00 \|
	\| `SENTS_R` \| 100.00 \|
	\| `SENTS_F` \| 100.00 \|
	\| `TAG_ACC` \| 99.14 \|
	\| `POS_ACC` \| 99.08 \|
	\| `MORPH_ACC` \| 98.80 \|
	\| `MORPH_MICRO_P` \| 99.45 \|
	\| `MORPH_MICRO_R` \| 99.32 \|
	\| `MORPH_MICRO_F` \| 99.39 \|
	\| `DEP_UAS` \| 92.81 \|
	\| `DEP_LAS` \| 90.80 \|
	\| `ENTS_P` \| 88.58 \|
	\| `ENTS_R` \| 86.20 \|
	\| `ENTS_F` \| 87.38 \|
	\| `LEMMA_ACC` \| 95.89 \|
	\| `COREF_LEA_F1` \| 46.72 \|
	\| `COREF_LEA_PRECISION` \| 45.91 \|
	\| `COREF_LEA_RECALL` \| 47.56 \|
	\| `NEL_SCORE` \| 34.29 \|
	\| `NEL_MICRO_P` \| 84.00 \|
	\| `NEL_MICRO_R` \| 21.54 \|
	\| `NEL_MICRO_F` \| 34.29 \|
	\| `NEL_MACRO_P` \| 86.71 \|
	\| `NEL_MACRO_R` \| 24.70 \|
	\| `NEL_MACRO_F` \| 37.28 \|



	### Training
	This model was trained using [spaCy](https://spacy.io) and logged to [Weights & Biases](https://wandb.ai/kenevoldsen/dacy-v0.2.0). You can find all the training logs [here](https://wandb.ai/kenevoldsen/dacy-v0.2.0).