Tutorial¶
Practical examples
We always start by importing HGC:
In [1]: import pandas as pd
In [2]: import hgc
Creating a HGC-enabled DataFrame¶
A hydrochemical groundwater analysis typically starts with a ‘normal’ Pandas DataFrame, in which each row contains a groundwater quality sample, and each column represents a water quality parameter. The DataFrame may contain concentrations of different chemical compounds, possibly exceeding the detection limit (denoted with a ‘<’ or ‘>’ prefix). There may also be errors in the data, such as negative concentrations or text placeholders.
Note
Please refer to the excellent WaDI package to get your excel or csv file with measurements in a format that HGC understands. In this tutorial, we create our own DataFrame for clarity.
In [3]: testdata = {'alkalinity': [0.0], 'Al': [2600], 'Ba': [44.0],
...: 'Br': [0.0], 'Ca': [2.0], 'Cl': [19.0],
...: 'Co': [1.2], 'Cu': [4.0], 'doc': [4.4],
...: 'F': [0.08], 'Fe': [0.29], 'K': [1.1],
...: 'Li': [5.0], 'Mg': [1.6], 'Mn': ['< 0.05'],
...: 'Na': [15.0], 'Ni': [7.0], 'NH4': ['< 0.05'],
...: 'NO2': [0.0], 'NO3': [22.6], 'Pb': [2.7],
...: 'PO4': ['0.04'], 'ph': [4.3], 'SO4': [16.0],
...: 'Sr': [50], 'Zn': [60.0] }
...:
In [4]: df = pd.DataFrame.from_dict(testdata)
In [5]: df
Out[5]:
alkalinity Al Ba Br Ca Cl ... Pb PO4 ph SO4 Sr Zn
0 0.0 2600 44.0 0.0 2.0 19.0 ... 2.7 0.04 4.3 16.0 50 60.0
[1 rows x 26 columns]
Since the data in this DataFrame is messy, we cannot use it yet for hydrochemical calculations. HGC can check if the data contains obvious errors:
In [6]: is_valid = df.hgc.is_valid
In [7]: is_valid
Out[7]: False
The DataFrame may contain any kind of columns and column names. However, HGC will only recognize a specific set of columns with names of hydrochemical parameters.
In [8]: from hgc.constants import constants
In [9]: print([*constants.atoms])
['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr', 'Rf', 'Db', 'Sg', 'Bh', 'Hs', 'Mt', 'Ds', 'Rg', 'Cn', 'Nh', 'Fl', 'Mc', 'Lv', 'Ts', 'Og']
In [10]: print([*constants.ions])
['CH4', 'H2S', 'S', 'CO2', 'alkalinity', 'O2_field', 'O2_lab', 'O2', 'KMnO4', 'NH4', 'NO2', 'NO3', 'N_kj', 'N', 'PO4', 'PO4_ortho', 'P', 'SiO2', 'SO4_ic', 'SO4', 'doc', 'toc', 'cod']
In [11]: print([*constants.properties])
['ec_field', 'ec_lab', 'ec', 'ph_field', 'ph_lab', 'ph', 'temp_field', 'temp_lab', 'temp', 'eh_field', 'turb', 'uva254']
You can also retreive the details of each compound, such as the expected units, full name or molar weight:
In [12]: constants.atoms['H']
Out[12]: Atom(feature='H', name='Hydrogen', unit='mg/L', mw=1.00794, oxidized=1.0, reduced=0.0, SMOW=0.0)
In [13]: constants.properties['ec']
Out[13]: Properties(feature='ec', name='EC converted to 20°C', example='read', unit='μS/cm', phreeq_name=None)
For your convenience, all units for all allowed (columns with) atoms, ions and properties are enlisted here here.
Since in this case our DataFrame contains negative concentrations, detection limits (rows with ‘<’ or ‘>’) and incorrect data types (e.g. string columns that are supposed to be numeric), HGC will initially report that the DataFrame is invalid. HGC can automatically solve inconsistencies with the ‘make_valid’ method. As a result, negative concentrations are replaced by 0; concentrations below detection limit are replaced by half the limit; concentrations above the upper detection limit are replaced by 1.5 times that limit.
In [14]: df.hgc.make_valid()
In [15]: is_valid = df.hgc.is_valid
In [16]: is_valid
Out[16]: True
In [17]: df
Out[17]:
alkalinity Al Ba Br Ca Cl ... Pb PO4 ph SO4 Sr Zn
0 0.0 2600 44.0 0.0 2.0 19.0 ... 2.7 0.04 4.3 16.0 50 60.0
[1 rows x 26 columns]
# Recognized HGC columns
In [18]: hgc_cols = df.hgc.hgc_cols
In [19]: print(hgc_cols)
['Li', 'F', 'Na', 'Mg', 'Al', 'Cl', 'K', 'Ca', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Br', 'Sr', 'Ba', 'Pb', 'alkalinity', 'NH4', 'NO2', 'NO3', 'PO4', 'SO4', 'doc', 'ph']
Calculations¶
Now that our DataFrame is valid, we can use all HGC methods, such as calculating the Base Exchange Index of each row; this is added as column to df:
In [20]: df.hgc.get_bex()
In [21]: df.bex
Out[21]:
0 0.238022
Name: bex, dtype: float64
We can also classify each sample into the Stuyfzand water type:
In [22]: df.hgc.get_stuyfzand_water_type()
In [23]: df.water_type
Out[23]:
0 g*NaNO3o
Name: water_type, dtype: object
Or get the sum of all anions (using the Stuyfzand method):
In [24]: df.hgc.get_sum_anions()
In [25]: df.sum_anions
Out[25]:
0 1.282127
Name: sum_anions, dtype: float64
It is also possible to compute common hydrochemical ratios between different compounds. HGC calculates ratios for all columns that are available and ignores any missing columns.
In [26]: df.hgc.get_ratios()
In [27]: df.cl_to_na
Out[27]:
0 0.82138
Name: cl_to_na, dtype: float64
For all these above mentioned get functions, the columns are added to the dataframe. Most of the times this is convenient, but there are also cases where you don’t want to add them to the DataFrame but only want to return the result. In that case, one could use the inplace argument; this works the same as native pandas methods that have this argument With inplace=True (the default), the columns are added to the DataFrame (as shown in the examples above). With inplace=False the columns are not added to the database but returned as a pandas Series or DataFrame. E.g., for the Stuyfzand water type (a Series) or ratios (a DataFrame):
In [28]: water_type = df.hgc.get_stuyfzand_water_type(inplace=False)
In [29]: water_type
Out[29]:
0 g*NaNO3o
Name: swt, dtype: object
In [30]: ratios = df.hgc.get_ratios(inplace=False)
In [31]: ratios
Out[31]:
cl_to_br cl_to_na ... hco3_to_sum_anions hco3_to_ca_and_mg
0 inf 0.82138 ... 0.0 0.0
[1 rows x 8 columns]
Consolidation¶
A common situation is that one single parameter of a sample is measured with several methods or in different places. Parameters such as EC and pH are frequently measured both in the lab and field, and SO4 and PO4 are frequently measured both by IC and ICP-OES. Normally we prefer the field data for EC and pH, but ill calibrated sensors or tough field circumstances may prevent these readings to be superior to the lab measurement. In such cases we want select from multiple columns the one to use for subsequent calculations, by consolidating into one single column containing the best measurements, possibly filling gaps with measurements from the inferior method. Let’s consider this example:
In [32]: testdata = {
....: 'ph_lab': [4.3, 6.3, 5.4], 'ph_field': [4.4, 6.1, 5.7],
....: 'ec_lab': [304, 401, 340], 'ec_field': [290, 'error', 334.6],
....: }
....:
In [33]: df = pd.DataFrame.from_dict(testdata)
In [34]: df
Out[34]:
ph_lab ph_field ec_lab ec_field
0 4.3 4.4 304 290
1 6.3 6.1 401 error
2 5.4 5.7 340 334.6
In [35]: df.hgc.make_valid()
In [36]: df
Out[36]:
ph_lab ph_field ec_lab ec_field
0 4.3 4.4 304 290.0
1 6.3 6.1 401 NaN
2 5.4 5.7 340 334.6
In [37]: df.hgc.consolidate(use_ph='field', use_ec='lab', use_temp=None,
....: use_so4=None, use_o2=None)
....:
In [38]: df
Out[38]:
ph ec
0 4.4 304.0
1 6.1 401.0
2 5.7 340.0
Warning
Note that omitting use_so4=None
in the function call, would let the function
fall back to the default which is ic
. Because the column so4_ic
is not in the dataframe
this will return an error. The same holds for use_temp
and use_o2
.
In [39]: df.hgc.consolidate(use_ph='field', use_ec='lab', use_temp=None,)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-2cbb7cd1e5e6> in <module>
----> 1 df.hgc.consolidate(use_ph='field', use_ec='lab', use_temp=None,)
~/checkouts/readthedocs.org/user_builds/hgc/checkouts/latest/hgc/samples_frame.py in consolidate(self, use_ph, use_ec, use_so4, use_o2, use_temp, use_alkalinity, merge_on_na, inplace)
346 raise ValueError(f"Column {source} not present in DataFrame. Use " +
347 f"use_{param.lower()}=None to explicitly ignore consolidating " +
--> 348 f"this column.")
349
350
ValueError: Column SO4_ic not present in DataFrame. Use use_so4=None to explicitly ignore consolidating this column.
Visualizing and exporting¶
The great thing about HGC is that your DataFrame gets hydrochemical superpowers, yet all functions that you expect from a regular Pandas DataFrame are still available, allowing you to easily import/export and visualize data.
In [40]: df.std()
Out[40]:
ph 0.888819
ec 49.034002
dtype: float64
In [41]: df.plot()
Out[41]: <AxesSubplot:>
Coupling to PHREEQC¶
Another great superpower of HGC is that it allows easy geochemistry directly on your dataframe! It currently has coupling with the popular geochemistry software PHREEQC via its python wrappers as implemented by the phreeqpython package.
Let’s extend the above DataFrame a little to make it more meaningful in the context of this coupling:
In [42]: testdata = {
....: 'ph_lab': [4.5, 5.5, 7.6], 'ph_field': [4.4, 6.1, 7.7],
....: 'ec_lab': [304, 401, 340], 'ec_field': [290, 'error', 334.6],
....: 'temp': [10, 10, 10],
....: 'alkalinity': [0, 7, 121],
....: 'O2': [11, 0, 0],
....: 'Na': [9,20,31], 'K':[0.4, 2.1, 2.0],
....: 'Ca':[1,3,47],
....: 'Fe': [0.10, 2.33, 0.4],
....: 'Mn': [0.02, 0.06, 0.13],
....: 'NH4': [1.29, 0.08, 0.34],
....: 'SiO2': [0.2, 15.4, 13.3],
....: 'SO4': [7,19,35],
....: 'NO3': [3.4,0.1,0],
....: }
....:
In [43]: df = pd.DataFrame.from_dict(testdata)
In [44]: df.hgc.make_valid()
In [45]: df.hgc.consolidate(use_ph='lab', use_ec='lab', use_temp=None,
....: use_so4=None, use_o2=None)
....:
With this DataFrame, we can do some PHREEQC calculations. For example, we can calculate the saturation index of different minerals like Calcite:
In [46]: df.hgc.get_saturation_index('Calcite')
In [47]: df['si_calcite'] # or df.si_calcite
Out[47]:
0 -999.000000
1 -4.722956
2 -0.288641
Name: si_calcite, dtype: float64
The mineral name will be added as a column named si_<mineral_name> where <mineral_name> is the name of the mineral as given to PHREEQC but all letters in lower case (and don’t forget the underscore). The saturation index (SI) of a mineral can only be retrieved if they are defined in the phreeqc database used by phreeqpython. If the mineral is not defined, always an SI of -999 will be returned.
This also works for the partial pressure of gasses (because in PhreeqC both minerals and gasses are defined as PHASES; see below for explanation of the coupling to PhreeqC). But it looks better if one uses the alias partial_pressure which returns the same values but with a different name of the column (prepending pp instead of si, since it is the partial pressure and not the saturation index).
In [48]: df.hgc.get_saturation_index('CO2(g)')
In [49]: df['si_co2(g)']
Out[49]:
0 -999.000000
1 -1.715479
2 -2.609626
Name: si_co2(g), dtype: float64
In [50]: df.hgc.get_partial_pressure('CO2(g)')
In [51]: df['pp_co2(g)']
Out[51]:
0 -999.000000
1 -1.715479
2 -2.609626
Name: pp_co2(g), dtype: float64
Similar to the SI, the specific conductance (SC), also known as electric conductance (EC) or EGV, is simply retrieved by calling:
In [52]: df.hgc.get_specific_conductance()
In [53]: df.sc
Out[53]:
0 37.193926
1 64.734047
2 218.187466
Name: sc, dtype: float64
Internally, these methods call the method get_phreeqpython_solutions to retrieve instances of the PhreeqPython Solution class. PhreeqPython is a Python package that allows the use of the Geochemical modeling package PhreeqC from within Python. HGC leverages this package to have a PhreeqC solution (or actually a PhreeqPython solution) for every row of the SamplesFrame. These are available to the user by calling
In [54]: df.hgc.get_phreeqpython_solutions()
In [55]: df.pp_solutions
Out[55]: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/IPython/lib/pretty.py in pretty(self, obj)
392 if cls is not object \
393 and callable(cls.__dict__.get('__repr__')):
--> 394 return _repr_pprint(obj, self, cycle)
395
396 return _default_pprint(obj, self, cycle)
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
698 """A pprint that just redirects to the normal repr function."""
699 # Find newlines and replace them with p.break_()
--> 700 output = repr(obj)
701 lines = output.splitlines()
702 with p.group():
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/core/series.py in __repr__(self)
1469 min_rows=min_rows,
1470 max_rows=max_rows,
-> 1471 length=show_dimensions,
1472 )
1473 return buf.getvalue()
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/core/series.py in to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
1532 max_rows=max_rows,
1533 )
-> 1534 result = formatter.to_string()
1535
1536 # catch contract violations
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/format.py in to_string(self)
389
390 fmt_index, have_header = self._get_formatted_index()
--> 391 fmt_values = self._get_formatted_values()
392
393 if self.is_truncated_vertically:
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/format.py in _get_formatted_values(self)
378 float_format=self.float_format,
379 na_rep=self.na_rep,
--> 380 leading_space=self.index,
381 )
382
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/format.py in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
1238 )
1239
-> 1240 return fmt_obj.get_result()
1241
1242
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/format.py in get_result(self)
1269
1270 def get_result(self) -> list[str]:
-> 1271 fmt_values = self._format_strings()
1272 return _make_fixed_width(fmt_values, self.justify)
1273
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/format.py in _format_strings(self)
1332 for i, v in enumerate(vals):
1333 if not is_float_type[i] and leading_space:
-> 1334 fmt_values.append(f" {_format(v)}")
1335 elif is_float_type[i]:
1336 fmt_values.append(float_format(v))
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/format.py in _format(x)
1312 else:
1313 # object dtype
-> 1314 return str(formatter(x))
1315
1316 vals = extract_array(self.values, extract_numpy=True)
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/printing.py in pprint_thing(thing, _nest_lvl, escape_chars, default_escapes, quote_strings, max_seq_items)
231 result = f"'{as_escaped_string(thing)}'"
232 else:
--> 233 result = as_escaped_string(thing)
234
235 return result
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/pandas/io/formats/printing.py in as_escaped_string(thing, escape_chars)
207 escape_chars = escape_chars or ()
208
--> 209 result = str(thing)
210 for c in escape_chars:
211 result = result.replace(c, translate[c])
~/checkouts/readthedocs.org/user_builds/hgc/envs/latest/lib/python3.7/site-packages/phreeqpython/solution.py in __str__(self)
239 # pretty printing
240 def __str__(self):
--> 241 return "<PhreeqPython.Solution."+self.__class__.__name__ + " with number '" + self.number + "'>"
TypeError: can only concatenate str (not "int") to str
Because all elements in this column are PhreeqPython Solution’s, PhreeqC can be used to calculate all kind of properties of each water sample of each row in the SamplesFrame. In the documentation of PhreeqPython all these are described. For example, one can derive the specific conductance or pH from the first sample:
In [56]: df.pp_solutions[0].sc
Out[56]: 37.19392619168676
In [57]: df.pp_solutions[0].pH
Out[57]: 4.5
or for all the samples:
In [58]: [s.sc for s in df.pp_solutions]
Out[58]: [37.19392619168676, 64.73404717149515, 218.18746625402372]
Note that these are the exact same results as above when df.hgc.get_specific_conductance() was called.
But also more involved operations are possible, for example, inspecting the speciation of the first sample in the original SamplesFrame df:
In [59]: df.pp_solutions[0].species
Out[59]:
{'Amm': 4.107427346238537e-10,
'AmmH+': 7.145332058236736e-05,
'AmmHSO4-': 6.038481976861183e-08,
'Ca+2': 2.4719556914911714e-05,
'CaHSO4+': 4.107750059503443e-11,
'CaOH+': 1.206793597413229e-13,
'CaSO4': 2.3258121349793236e-07,
'Fe(OH)2': 3.377294151086896e-19,
'Fe(OH)2+': 3.0790467275285987e-13,
'Fe(OH)3': 6.15008335198951e-16,
'Fe(OH)3-': 3.481413027742945e-25,
'Fe(OH)4-': 9.631965209582988e-21,
'Fe(SO4)2-': 4.549314587256489e-19,
'Fe+2': 1.7766173469538303e-06,
'Fe+3': 8.017708354854963e-16,
'Fe2(OH)2+4': 2.084732141128855e-25,
'Fe3(OH)4+5': 7.00155604671684e-35,
'FeHSO4+': 2.9536979682997056e-12,
'FeHSO4+2': 3.2081256063657047e-20,
'FeOH+': 5.078860843975143e-12,
'FeOH+2': 5.756579318382429e-14,
'FeSO4': 1.4104744825322718e-08,
'FeSO4+': 3.3701491576993005e-16,
'H+': 3.237684629013282e-05,
'H2': 8.284273096100859e-21,
'H2O': 55.50929780738274,
'H2SiO4-2': 7.095512381986744e-21,
'H3SiO4-': 8.93438650865223e-12,
'H4SiO4': 3.3288309795364082e-06,
'HSO4-': 1.5224767993236832e-07,
'K+': 1.022746595129961e-05,
'KSO4-': 3.578529154040609e-09,
'Mn(NO3)2': 3.873920247618094e-15,
'Mn(OH)3-': 1.6845954285204035e-28,
'Mn+2': 3.612266976206157e-07,
'Mn+3': 1.2482278567749458e-29,
'MnOH+': 7.539734502001295e-14,
'MnSO4': 2.8321415294557053e-09,
'NO3-': 5.483700575514323e-05,
'Na+': 0.0003913712170485659,
'NaOH': 3.552855616723773e-24,
'NaSO4-': 1.1642908789175805e-07,
'O2': 0.00034378298237867114,
'OH-': 9.532511716174231e-11,
'SO4-2': 7.229439162542111e-05}
Note that units of these speciation calculations are in mmol/L.
One could even manipulate the solution by letting for example calcite precipitate and see how this changes pH
In [60]: desaturated_solutions = [s.desaturate('Calcite') for s in df.pp_solutions]
In [61]: pd.DataFrame(dict(
....: original=df.ph,
....: desaturated=[s.pH for s in desaturated_solutions],)
....: ).round(2)
....:
Out[61]:
original desaturated
0 4.5 4.48
1 5.5 5.47
2 7.6 7.60
For more examples, please visit the examples on the Github page of PhreeqPython.