How can I label a column of strings into numbered groups based on another column containing substrings?

How can I label a column of strings into numbered groups based on another column containing substrings?

Problem Description:

I have the 1st column that is around 4920 different chemical compounds.

For example:

0              Ag(AuS)2      
1            Ag(W3Br7)2      
2      Ag0.5Ge1Pb1.75S4     
3     Ag0.5Ge1Pb1.75Se4     
4                Ag2BBr      
...                 ...      
4916             ZrTaN3     
4917               ZrTe      
4918             ZrTi2O      
4919             ZrTiF6      
4920               ZrW2  

I have the 2nd column that has all the elements of the periodic table numerically listed atomic number

0      H
1     He
2     Li
3     Be
4      B
..   ...
113   Fl
114  Uup
115   Lv
116  Uus
117  Uuo

How can I classify the first column into groups based on the compound’s first element corresponding to their atomic number from column 2 so that I can return the first column

The atomic number of Ag = 27
The atomic number of Zr = 40

    0            47      
    1            47      
    2            47     
    3            47    
    4            47      
    ...                 ...      
    4916         40    
    4917         40       
    4918         40         
    4919         40        
    4920         40     

Solution – 1

Since the first element could be a varying number of letters, the simplest solution would be to use the regex approach for getting the first section.
For example:

import re

compounds = ["Ag(AuS)2", "HTiF", "ZrTaN3"]

for compound in compounds:
    match = re.match(r"[A-Z][a-z]*", compound)
    if match:
        fist_element = match.group(0)
        print(fist_element)

this will print out the first element of each compound.
Note: If there are some more complex compounds and you need to adjust your regex, I recommend using https://regex101.com/ as a playground.

Once you have that information it just needs to be connected with the element in the second column which would be easiest if you mapped that column to a dictionary resembling:

{ H: 0, He: 1, Li: 2 ...}

which would allow you to simply get the element index by calling dict_with_elements.get(first_element).

From there on, the rest is just looping and writing data. I hope this helps.

Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject