Reading, Writing And Control Dynamic Instantiated HTML Web Table Using Python Selenium

- 1 answer

Assume that there are some specific searcher which search some goods, and I search with 'Teddy'. Number of total result is 140 and it is displayed in small table made of <div> for each row and column(row for each content, column for info of content), which has scroll bar. This shows me a good list up to 5 in single display(each content use 40px for their height), if I need to see more, I need to scroll down this table.

The HTML is seems like below if I see goods at 45th to 49th(45th content are at the top of current view).

<div class="table-body" style="height:200px">            // This contains scrollbar
    <div class="table-panel" style="height:5600px">
        <div class="ag-row" style="height:40px row="42"> // This is each row of goods
            <div class="name">Teddy</div>                // This is each column of good
            <div class="price">200</div>
            <input class="amount">0</input>              // Input text box for put amount of goods to buy
        <div class="ag-row" style="height:40px row="43">
            <div class="name">Brown Bess</div>
            <div class="price">230</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="44"> // <-- This is what I'am seeing at the top. 0 based row attribute
            <div class="name">Blue</div>
            <div class="price">280</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="45">
            <div class="name">Scientist</div>
            <div class="price">400</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="46">
            <div class="name">Mouse</div>
            <div class="price">120</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="47">
            <div class="name">Hangover</div>
            <div class="price">150</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="48"> // <-- This is what I'am seeing last.
            <div class="name">Building</div>
            <div class="price">420</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="50">
            <div class="name">Park</div>
            <div class="price">60</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="51">
            <div class="name">Coffee</div>
            <div class="price">160</div>
            <input class="amount">0</input>
        <div class="ag-row" style="height:40px row="49">
            <div class="name">Juice</div>
            <div class="price">100</div>
            <input class="amount">0</input>

Also it is my imaginary code, real code is much more complicate due to it's style, attributes and scripts. I think it is enough to ask my subject.

I checked the behavior of this web page. It only makes it's html near where I'am seeing. When I see near 100th content, it create html between 92nd ~ 108th -- how many it is instantiated is quite random. When I scroll down or up, it remove content far from current position and make new one for current screen.

And I need to parse that data and need to make some list-like data structure in python. Cause it instantiate partial data depending on screen(precisely speaking, it seems like it uses scrollbar for checking where I am seeing) I tried to control scrollbar and crop all data in html and remove duplicates. Codes are below

from selenium import webdriver

def iterateOptionTable(driver):
    el_viewport = driver.find_element_by_class_name('table-body')
    driver.execute_script('document.getElementsByClassName("{}")[0].scrollTop = 0;'.format('table-body'))
    max_height = int(driver.execute_script('return document.getElementsByClassName("{}")[0].scrollHeight;'.format('table-body')))
    scrolling_amnt = int(40 * 5) # Each row height is 40
    cur_scroll = 0
    table = defaultdict(int) # Don't put into list which is already pushed
    ret = []
    while cur_scroll < max_height:
            el_products = el_viewport.find_elements_by_xpath('./div/*')
            for el_p in el_products:
                rownum = int(el_p.get_attribute("row"))
                if rownum not in table:
                    table[rownum] = True
            yield ret   # List of WebElement of good
            cur_scroll += scrolling_amnt
            driver.execute_script('document.getElementsByClassName("{}")[0].scrollTop = {};'.format('table-body', cur_scroll))

def parseElementToData(elems):
    ret = []
    for el in elems:
        single_data = DO_EXTRACT_DATA_FROM_EL()

def parseTable(driver):
    ret = []
    for elems in iterateOptionTable(driver):
        data += parseElementToData(elems)
    return ret

There are several other jobs for the page, it is programmed using yield due to webpage hierarchy.

It pretty works well in debugger when I execute one by one. But in real runtime, it dose not even scroll down it's table. Not to mention that it is inefficient I think. Also tried same version of Javascript by executing script from selenium.

Is there are more sophisticated way or can I get a answer for why these are not working in normal situation. I'm quite new to web crawling and selenium. Please Help :)



I failed to what I intended to. Scrolling is not well interactable in such condition. I managed to solve this via selecting single cell in table and send 'Keys.DOWN' button for scrolling down.