LLM based Curation
LLMCurate
Bases: BaseCurate
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model |
AutoModelForCausalLM
|
Instantiated LLM |
required |
tokenizer |
AutoTokenizer
|
Instantiated tokenizer corresponding to the |
required |
verbose |
bool
|
Sets the verbosity level during execution. |
False
|
Examples:
```python
llmc = LLMCurate(model, tokenizer)
ds = llmc.run(
data,
column_to_curate,
ds_column_mapping,
prompt_variants,
llm_response_cleaned_column_list,
answer_start_token,
answer_end_token,
batch_size,
max_new_tokens
)
``
where
*
modeland
tokenizerare the instantiated LLM model and tokenizer objects respectively
*
datais a pandas dataframe containing samples with our target text for curation under column
column_to_curate*
ds_column_mappingis the dictionary mapping of entities used in the LLM prompt and the corresponding columns in
data. For example,
ds_column_mapping={'INPUT' : 'input_column'}would imply that text under
input_columnin
datawould be passed to the LLM in the format
"[INPUT]row['input_column'][/INPUT]"for each
rowin
data*
prompt_variantsis the list of LLM prompts to be used to curate
column_to_curateand
llm_response_cleaned_column_listis the corresponding list of column names to store the reference responses generated using each prompt
*
answer_start_tokenand
answer_end_token` are optional text phrases representing the start and end of the answer respectively.
ds
is a dataset object with the following additional features -
1. Feature for each column name in llm_response_cleaned_column_list
2. LLM Confidence score for each text in column_to_curate
Source code in dqc/llm.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|
run(column_to_curate, data=None, ds_column_mapping={}, prompt_variants=[''], skip_llm_inference=False, llm_response_cleaned_column_list=['reference_prediction'], return_scores=True, answer_start_token='', answer_end_token='', scoring_params={'scoring_method': 'exact_match', 'case_sensitive': False}, **options)
Run LLMCurate on the input data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_to_curate |
str
|
Column name in |
required |
data |
Union[DataFrame, Dataset]
|
Input data for LLM based curation |
None
|
ds_column_mapping |
dict
|
Mapping of entities to be used in the LLM prompt and the corresponding columns in the input data. Defaults to {}. |
{}
|
prompt_variants |
List[str]
|
List of different LLM prompts to be used to curate the labels under |
['']
|
skip_llm_inference |
bool
|
Indicator variable to prevent re-running LLM inference. Set to |
False
|
llm_response_cleaned_column_list |
list
|
Names of the columns that will contain LLM predictions for each input prompt in |
['reference_prediction']
|
return_scores |
bool
|
Indicator variable set to |
True
|
answer_start_token |
str
|
Token that indicates the start of answer generation. Defaults to '' |
''
|
answer_end_token |
str
|
Token that indicates the end of answer generation. Defaults to '' |
''
|
scoring_params |
dict
|
Parameters related to util function |
{'scoring_method': 'exact_match', 'case_sensitive': False}
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Dataset
|
Input dataset with reference responses. If |
Source code in dqc/llm.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 |
|