/2026/05/31/app-surface-data-collection-notes

Notes on App-Surface Data Collection

2026-05-31

Some product data is easy to fetch from a web endpoint. Some of it only exists on the app screen. The interesting design problem is not just “how do I scrape it”, but how to build a small system that can combine both surfaces without becoming fragile.

That mobile-only part changes the architecture. If the signal is only visible in the app, the collector needs real Android workers. ADB becomes the control plane: open the app, navigate to the product, dump the UI, parse the screen, and report the result back into the same product data model.

For a shopping or content platform, a useful internal tool usually needs three kinds of information:

stable identifiers, like item IDs and canonical URLs
structured public data, like title, shop, price, sales text, and SKU variants
app-only signals, like badges, tags, ranking hints, recent cart activity, or visible social proof

These do not arrive through one clean API. The design has to accept that reality and make the collection path observable, retryable, and boring enough to operate.

1. Normalize Input Before Doing Anything Expensive

The first boundary is input normalization.

Users paste messy things: short links, long links, copied share text, deep links, or just an item ID. The rest of the system should not care. It should receive one canonical product identity.

function resolveProductInput(input):
  text = trim(input)

  if text contains direct goods id:
    return { itemId, source: "direct" }

  urls = extractUrls(text)

  for url in urls:
    if url contains goods id:
      return { itemId, source: "url", resolvedUrl: url }

    if url is short link:
      finalUrl = followRedirect(url)
      if finalUrl contains goods id:
        return { itemId, source: "short_link", resolvedUrl: finalUrl }

  raise "no product id found"

This keeps every downstream operation simple. The web collector, app collector, product table, query history, and metric jobs all speak in item IDs.

I like this pattern because it moves chaos to the edge. The user can paste almost anything, but inside the system there is one durable key.

2. Split Web Data From App Detail Data

The second design choice is separating cheap web data from expensive app detail data.

Web data is usually faster and safer:

product title
sales text
shop name
shop URL
location
SKU variants
price fields

App UI collection is slower and more operationally fragile:

it needs a real Android device or emulator
it depends on current UI structure
it can fail because of loading states, login state, app updates, or network conditions
it consumes a scarce device slot

So the default flow should be layered:

function addProduct(input, includeDetail):
  resolved = resolveProductInput(input)
  product = upsertProduct(resolved.itemId)

  product.webStatus = "running"
  product = refreshWebSnapshot(product)

  if includeDetail:
    enqueueAppDetailTask(product)
    product.detailStatus = "pending"

  return product

This gives the user a useful record quickly. App detail collection becomes an optional enrichment step instead of a blocker for the whole product.

3. Treat Devices as a Pool, Not a Constant

Remote Android collection should not assume one perfect device. The reason for having Android workers at all is that some information is only exposed through the mobile app surface. Once a phone becomes part of the data path, it should be modeled as infrastructure, not as a hidden script dependency.

A better model is a small device pool. Each device has metadata:

name
ADB serial
phone IP
reverse SSH port
app package
online state
busy or idle state
last successful collection time

Then task dispatch can be conservative:

function collectMobileOnlySignals(product):
  device = selectIdleDevice(product.preferredDeviceId)
  if device is null:
    enqueue(product)
    return "pending"

  with adbSession(device.serial):
    openProductInApp(product.itemId)
    xml = dumpUi()
    tags = parseVisibleTags(xml)
    saveEvidence(product, xml, tags, device)

function selectIdleDevice(preferredDeviceId = null):
  statuses = adbDevices()
  busy = queryRunningTaskDeviceIds()
  devices = activeDevicesOrderedById()

  if preferredDeviceId exists:
    device = find(devices, preferredDeviceId)
    if device is online and device.id not in busy:
      return device
    return null

  for device in devices:
    if device.id in busy:
      continue
    if statuses[device.serial] == "device":
      return device

  return null

This matters because the device is not just an implementation detail. It is a scarce worker. If two tasks use the same phone at the same time, the UI state becomes non-deterministic.

4. Make App Collection a Queue

Once devices are treated as workers, app detail collection naturally becomes a queue.

The task record should store:

item ID
input and canonical URL
status: pending, running, success, failed
assigned device
API key or user that requested it
started and finished timestamps
extracted tags
raw UI nodes
error message
elapsed time

The dispatcher can stay simple:

function dispatchPendingTasks():
  tasks = oldestPendingTasks(limit = 10)

  for task in tasks:
    device = selectIdleDevice(task.preferredDeviceId)
    if device is null:
      continue

    task.status = "running"
    task.deviceId = device.id
    task.startedAt = now()
    save(task)

    runTask(task.id)

function runTask(taskId):
  task = loadTask(taskId)
  device = loadDevice(task.deviceId)

  try:
    result = collectFromApp(task.itemId, device.serial)
    task.status = "success"
    task.tags = result.tags
    task.rawItems = result.items
    task.elapsedMs = result.elapsedMs
    task.finishedAt = now()
    markProductDetailSuccess(task.productId)
    saveMetrics(task.productId, result.tags)
    markDeviceSeen(device)
  catch error:
    task.status = "failed"
    task.error = error.message
    task.finishedAt = now()
    markProductDetailFailed(task.productId)

  save(task)
  dispatchPendingTasks()

The point is not to build a large job system too early. The point is to make the state explicit enough that failures are visible and retryable.

5. Parse the App Screen as Evidence

App UI collection is different from API collection. The screen is the data source.

A practical approach is:

open the product deep link
wait for the screen to settle
dump the UI XML
scan text and content descriptions
filter likely tag strings
use screen bounds to ignore irrelevant regions
swipe horizontally when tags overflow
deduplicate while preserving order

Pseudocode:

function collectFromApp(itemId, serial):
  openDeepLink(serial, "app://goods_detail/" + itemId)
  waitForSettledScreen()

  screen = getScreenSize(serial)
  allItems = []
  seen = set()
  lastTagY = null

  for round in 0..maxSwipes:
    xml = dumpUiXml(serial)

    if pageShowsLoadError(xml):
      raise "product page load failed"

    items = extractCandidateTags(xml, screen.height)

    for item in items:
      allItems.append(item)
      seen.add(item.text)

    if noNewTagsFound(seen):
      break

    lastTagY = medianY(items)
    swipeTagArea(serial, y = lastTagY)
    waitForSettle()

  return uniqueByText(allItems)

This is intentionally heuristic. UI extraction is not a clean contract. The best design is to preserve enough raw evidence to debug bad matches later.

That is why storing raw UI node information is useful. It turns “the collector missed something” into a problem that can be inspected.

6. Do Not Store Only Final Tags

Final tags are convenient, but they are not enough.

For analysis, I want three layers:

the product record
the query or collection history
extracted metrics over time

The product record answers “what is this item now?” The query history answers “what happened during collection?” The metric table answers “how did a signal change?”

For example, a tag like 24小时内123人加购 is both display text and structured data. It should be saved as raw text, but it can also become a metric:

patterns = [
  ("24h_cart", /24小时内(\d+)\+?人加购/),
  ("7d_fav", /近7天新增(\d+)\+?人收藏/)
]

function saveMetrics(product, tags):
  for tag in tags:
    for dimension, pattern in patterns:
      if pattern matches tag:
        insertMetric(product.id, product.itemId, dimension, capturedNumber)

This makes the tool more than a scraper. It becomes a small time-series surface for product signals.

7. Keep the Console Operational

The frontend should feel like an operations console, not a marketing dashboard.

The important screens are practical:

product list with filters
detail enrichment status
latest query result
device status and busy state
API key management
marketplace collection progress
metric summary and history

The UI should show state instead of hiding it. pending, running, success, and failed are not backend details. They are the product.

When collection depends on a remote phone, silence is expensive. A user needs to know whether the device is offline, busy, waiting, collecting, or failed.

8. Keep Secrets and Permissions Narrow

Even a small internal tool needs a security shape.

The simple version is enough:

raw API keys are shown once
only SHA-256 hashes are stored
keys can expire
keys can be revoked
device management and key management are separate permissions

Pseudocode:

function createKey(name, expiresAt, permissions):
  raw = "sk-" + secureRandom()

  record = {
    name,
    keyPrefix: firstVisiblePart(raw),
    keyHash: sha256(raw),
    expiresAt,
    permissions,
    status: "active"
  }

  save(record)
  return raw

function verifyKey(raw):
  if raw does not start with "sk-":
    return null

  record = findByHash(sha256(raw))

  if record is missing or revoked or expired:
    return null

  record.lastUsedAt = now()
  return record

This keeps operational access simple without storing reusable secrets in plain text.

Current Rule

For app-surface data collection, my rule is:

normalize early, collect in layers, store evidence, extract metrics late

The useful system is not the one that perfectly scrapes every screen once. It is the one that can explain what it tried, what it found, which worker did it, why it failed, and which signals are worth tracking over time.