Parsing An Html Table With Pd.read_html Where Cells Contain Full-tables Themselves
I need to parse a table from html that has other tables nested within the larger table. As called below with pd.read_html, each of these nested tables are parsed and then 'inserted
Solution 1:
You can't use read_html
to read nested tables, but you can roll your own html reader and use read_html
for table cells:
import pandas as pd
import bs4
with open('up_pf00344.test.html') as f:
html = f.read()
soup = bs4.BeautifulSoup(html, 'lxml')
results = soup.find(attrs = {'id': 'results'})
# get first visible header row as dataframe headers
for row in results.thead.find_all('tr'):
if 'display:none' not in row.get('style',''):
df = pd.DataFrame(columns=[col.get_text() for col in row.find_all('th')])
break
# append all table rows to dataframe
for row in results.tbody.find_all('tr', recursive=False):
if 'display:none' in row.get('style',''):
continue
df_row = []
for col in row.find_all('td', recursive=False):
table = col.find_all('table')
df_row.append(pd.read_html(str(col))[0] if table else col.get_text())
df.loc[len(df)] = df_row
Result of df.iloc[0].map(type)
:
<class 'str'>
Entry <class 'str'>
Organism <class 'str'>
Protein names <class 'str'>
Gene names <class 'str'>
Length <class 'str'>
Cross-reference (Pfam) <class 'str'>
Cross-reference (InterPro) <class 'str'>
Taxonomic lineage IDs <class 'str'>
Subcellular location [CC] <class 'str'>
Signal peptide <class 'str'>
Transit peptide <class 'str'>
Topological domain <class 'pandas.core.frame.DataFrame'>
Transmembrane <class 'pandas.core.frame.DataFrame'>
Intramembrane <class 'pandas.core.frame.DataFrame'>
Sequence caution <class 'str'>
Caution <class 'str'>
Taxonomic lineage (SUPERKINGDOM) <class 'str'>
Taxonomic lineage (KINGDOM) <class 'str'>
Taxonomic lineage (PHYLUM) <class 'str'>
Cross-reference (RefSeq) <class 'str'>
Cross-reference (EMBL) <class 'str'>
e <class 'str'>
Bonus: As your table rows have an id
, you could use it as index of your dataframe df.loc[row.get('id')] = df_row
instead of df.loc[len(df)] = df_row
.
Post a Comment for "Parsing An Html Table With Pd.read_html Where Cells Contain Full-tables Themselves"